Literature DB >> 35194320

Rethinking the framework constructed by counterfactual functional model.

Chao Wang¹, Linfang Liu¹, Shichao Sun², Wei Wang¹.

Abstract

The causal inference represented by counterfactual inference technology breathes new life into the current field of artificial intelligence. Although the fusion of causal inference and artificial intelligence has an excellent performance in many various applications, some theoretical justifications have not been well resolved. In this paper, we focus on two fundamental issues in causal inference: probabilistic evaluation of counterfactual queries and the assumptions used to evaluate causal effects. Both of these issues are closely related to counterfactual inference tasks. Among them, counterfactual queries focus on the outcome of the inference task, and the assumptions provide the preconditions for performing the inference task. Counterfactual queries are to consider the question of what kind of causality would arise if we artificially apply the conditions contrary to the facts. In general, to obtain a unique solution, the evaluation of counterfactual queries requires the assistance of a functional model. We analyze the limitations of the original functional model when evaluating a specific query and find that the model arrives at ambiguous conclusions when the unique probability solution is 0. In the task of estimating causal effects, the experiments are conducted under some strong assumptions, such as treatment-unit additivity. However, such assumptions are often insatiable in real-world tasks, and there is also a lack of scientific representation of the assumptions themselves. We propose a mild version of the treatment-unit additivity assumption coined as M-TUA based on the damped vibration equation in physics to alleviate this problem. M-TUA reduces the strength of the constraints in the original assumptions with reasonable formal expression.

Entities: Chemical

Keywords: Causal effect; Counterfactual approach; Functional model; Treatment-unit additivity assumption

Year: 2022 PMID： 35194320 PMCID： PMC8853228 DOI： 10.1007/s10489-022-03161-8

Source DB: PubMed Journal: Appl Intell (Dordr) ISSN： 0924-669X Impact factor: 5.019

Introduction

Counterfactual inference, as an indispensable method of causal inference, helps create human self-awareness and imbue life experiences with meaning, which is embodied when we modify a factual prior event and then evaluate the consequences of that change [1]. In the classic Rubin causal model (RCM), counterfactual results usually refer to unobserved potential outcomes [2]. A typical application representative is counterfactual queries (CQs) [3]. A counterfactual query is a question of what kind of causality would arise if we artificially adopt the conditions contrary to the facts. Formally, the evaluation of CQs can be expressed as “If happened, would have occurred?”, where is the counterfactual antecedent. CQs embody our reflections on what already happening in the real world. For example, in Fig. 1,1 data released by Johns Hopkins University (JHU) shows that as of August 19, 2021, EST, the cumulative number of confirmed cases of COVID-19 (coronavirus disease 2019) in the United States amounted to 37,155,209 cases and the cumulative number of deaths amounted to 624,253 cases. The data also shows that the current cumulative number of confirmed cases in the United States accounts for about 17.75% of the more than 200 million confirmed cases worldwide; the cumulative number of deaths in the United States accounts for about 14.20% of the more than 4.3 million deaths worldwide. In face of the onslaught of the epidemic, one might ponder the following query: If the U.S. has taken decisive measures, would the number of confirmed cases have been effectively controlled instead of spreading as wildly as it is now?

Fig. 1

COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at JHU

COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at JHU Originally, most studies on counterfactual inferences (such as the query above) focus on the field of philosophy. Philosophers establish the form of a logical relationship constituting a logical world, which is consistent with the counterfactual antecedent and must be the closest to the real world (for the convenience of description, we call it the closest world approach) [4]. Further, Ginsberg [5] applies similar counterfactual logic to analyze problems of AI tasks, which relies on logic based on the closest world approach. However, the disadvantage of the closest world approach is that it lacks constraints on closeness measures. Regarding the above issue, Balke and Pearl [3] are committed to explaining the closest world approach. Specifically, they suggest that turning a CQ into a probability problem, named, the probabilistic evaluation of counterfactual queries (PECQs). In other words, PECQs focus more on the probability of an event occurring in a specific CQ, rather than just outputting “True” or “False” (or “Yes” or “No”, etc.) for this query. PECQs motivate us to deeply rethink counterfactual problems in many AI applications. For example, we know that COVID-19 has caused economic losses and increased unemployment in the United States [6]. An important reason is that the government has not dealt with the epidemic promptly.2 Based on the facts that have already occurred, we may reflect on the following question, CQ1: If the government issued effective policies in time to control the spread of COVID-19, would the unemployment rate in the United States still have raised? Note that, in CQ1, there is a clear causal relationship, that is, COVID-19 has caused the unemployment rate in the United States to rise. Therefore, in response to CQ1, an essential task is to be able to evaluate the degree of belief in the counterfactual consequence (i.e., probability evaluation) after considering the facts that have already happened. In other words, it is equivalent to evaluating the probability of a potential (or counterfactual) outcome given the antecedent. Moreover, in CQ1, it is a fact that the COVID-19 sweeps the world and causes the unemployment rate in the United States to rise. Hence, we should focus on analyzing what is the probability that the unemployment rate in the United States will rise if there is no COVID-19? This is undoubtedly an influence on the government to make decisions. Therefore, evaluating counterfactual queries like these has far-reaching significance for practical application. With the widespread application of causal inference in the field of AI [7, 8], the current popular method is to adopt the functional model (FM) [9] for inference. FM takes a CQ as an input and finally outputs the probability evaluation of the CQ by combining prior knowledge and internal inference mechanisms. The evaluation of CQs has benefited many research fields and tasks, such as the determination of person liable [10], marketing and economics [11], personalized policies [12], medical imaging analysis [13, 14], Bayesian network [7], high dimensional data analysis [15], abduction reasoning [16], the intervention of tabular data [8], epidemiology [17], natural language processing (NLP) [18, 19] and graph neural networks (GNN) [20, 21]. In particular, FM can provide powerful interpretability for machine learning model decisions [22-25], which is one of the most concerning issues in the Artificial Intelligence (AI) community today.

Motivation

Judea Pearl discusses the limitations of the current machine learning theory and points out that current machine learning models are difficult to be used as the basis for strong AI [9]. An important reason is that the current machine learning approach is almost entirely in the form of statistics or “black box”, which brings serious theoretical limitations to its performance [26]. For example, it is difficult for current smart devices to make counterfactual inferences. A large number of researchers are increasingly interested in combining counterfactual inference with AI [27, 28], such as explaining consumer behavior [29], the study of viral pathogenesis [30], and predicting the risk of flight delays [31]. In addition, counterfactual inference has shown advantages in improving the robustness of the model [32, 33] and optimizing text generation tasks [34] and classification tasks [35]. Although counterfactual inference has set off a new upsurge in the field of machine learning, a deeper understanding of the existing models and methods is notably lacking. In our work, we focus on two basic aspects in the task of counterfactual inference. The first aspect focuses on the counterfactual framework and this aspect is related to the inference results of the model. The second aspect focuses on the preconditions for the counterfactual inference tasks. Specifically, the first aspect is based on a type of counterfactual approach (e.g., the functional model) in causal science. We analyze the credibility of some results obtained by using this counterfactual approach to evaluate CQs. Another aspect we are concerned about is the assumptions used in causal inference to estimate causal effects. Since causal effects depend on the potential results, however, we cannot observe all the potential outcomes of the experimental individual simultaneously (unobservable outcomes are usually called counterfactual outcomes). Therefore, some assumptions are often needed when estimating the causal effect. We pay attention to a commonly used strong assumption (i.e., the Treatment-Unit Additivity (TUA) assumption) and weaken it using some mathematical methods. Next, we specify the above two aspects to the following two issues (we use a real inference task (i.e., PECQs) as an example to explain the relationship between the two issues in Fig. 2.

Fig. 2

The framework of the probabilistic evaluation of counterfactual queries: these two issues spread over the same inference task, and these two issues are independent of each other. However, for the same counterfactual inference task, the plausibility of the output affects, the user’s confidence, and the strong assumptions premise determines the scope of the task

In the CQs tasks, although the output result of the FM is unique, this unique solution sometimes is ambiguous. For example, in the task of evaluating the probability solution of CQs by FM, if the model predicts that the probability of a CQ is 0, the result may be ambiguous. In other words, although the probability value predicted by the model in this situation is 0, it is still possible that the event will happen. Intuitively, the existence of statistical uncertainty may cause ambiguity of the inference results. Dawid [36] proves that even if the statistical uncertainty can be eliminated, the inference may also produce ambiguity. Therefore, when ambiguity cannot be eliminated, we must consider what may cause ambiguity and how to avoid trouble caused by ambiguity. The assumptions used to estimate causal effects in the data are strong, which are often violated in real-world applications. Some strong assumptions tend to constrain on individuals (e.g., individuals u in an experimental population ) to obtain the ideal environment in an experiment. This neglects to obtain the equivalent form of the assumption directly from the abstract level (e.g., the experimental population , the dataset itself). In some practical applications of causal inference, a challenging task requires researchers to make causal inferences in the absence of data. For example, in RCM, the causal effect is described as O − O, where O (O) is the result variable O displayed by subject (or individual) u under the c ontrol (c) group or t reatment (t) group. Unfortunately, we have no idea how to obtain O, and O at the same time no matter how large the dataset is. This situation is also called the fundamental problem of causal inference (FPCI) [37]. The framework of the probabilistic evaluation of counterfactual queries: these two issues spread over the same inference task, and these two issues are independent of each other. However, for the same counterfactual inference task, the plausibility of the output affects, the user’s confidence, and the strong assumptions premise determines the scope of the task Owing to the existence of FPCI, we can only apply additional assumptions on the data distribution to avoid it. Some typical assumptions are shown below: where 𝜖(u) denotes the individual causal effect of , and is the cardinality of set . Apparently, AOH is stronger than TUA. Therefore, in the second aspect, we focus on TUA, aiming to obtain the milder TUA assumption. Stable Unit Treatment Value Assumption (SUTVA) [38], where each O of u is treated as an independent event; Assumption of Homogeneity (AOH) [39], which requires that for any individual u and u, and any intervention method , always holds; Treatment-Unit Additivity (TUA) [36], some studies also call it the assumption of constant effect (AOCE). The TUA assumption constrains such an equivalence relationship that, for all individuals, the causal effect is the same for each individual under a defined intervention method. To address the two issues mentioned above, in this paper, our contributions are three-fold: We focus on a basic problem in the FM and primarily analyze the evaluation method of [3]. We find that FM sometimes produces ambiguous output results for some CQs, even if the final output result is unique. One of the important reasons is that FM needs to calculate the intersection between the two sets to get the final result when estimating the output probability. However, the intersection may be an empty set ∅, when estimating some special CQs. We provide a mild TUA assumption, called M-TUA, which incorporates the idea of the damped vibration equation. We prove theoretically that M-TUA can be applied to large datasets, and give a reasonable and rigorous mathematical description of this theory (see Theorem 1). Especially for some complex internal principles, we do not choose to use the “black box” method but hope to use M-TUA to try to reveal the complex internal relationship between certain parameters and assumptions and make some reasonable description and explanation.

Paper organization

The rest of this paper is organized as follows: In Section 2, we give the mathematical notation and their descriptions. In Section 3, we give a visualization of the FM inference mechanism and analyze the pitfalls of this inference mechanism based on concrete examples. In Section 4 and Section 5, we give a mild version of the TUA assumption (i.e., M-TUA), and theoretically prove the equivalent representation of the TUA assumption in the vector space and analyze the rationality and limitations of M-TUA. The comparison between TUA and M-TUA is in Section 6. Section 7 summarizes this paper.

Notation

In this section, the key mathematical notations and their descriptions are listed in Table 1.

Table 1

Key Notations and Descriptions

Notation	Description
∅	the empty set
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {R}=\{R_{1},..., R_{n}\}$\end{document}R={R1,...,Rn}	the set of variables R_i
r_i/\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {r}_{i}$\end{document}r^i	the value of R_i in the real/counterfactual world
{t,c}	t and c represent two different treatments (or intervention variables)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {U}=U_{t}\cup U_{c}$\end{document}U=Ut∪Uc	a population with a huge number of units u_i
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$U_{t}=\{{u^{t}_{1}},...,{u^{t}_{k}}\}$\end{document}Ut={u1t,...,ukt}	the set of some units receiving treatment t
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$U_{c}=\{{u^{c}_{1}},...,u^{c}_{k^{\prime }}\}$\end{document}Uc={u1c,...,uk′c}	the set of other units receiving treatment c, i.e., U_t ∩ U_c = ∅
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathbb {R}$\end{document}ℝ, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathbb {Z}^{+}$\end{document}ℤ+, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathbb {C}$\end{document}ℂ	the set of real numbers, positive integers, and complex numbers
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$A^{*}\in \mathbb {C}$\end{document}A∗∈ℂ	the complex conjugate of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$A\in \mathbb {C}$\end{document}A∈ℂ
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\|\mathcal {S}\|$\end{document}\|S\|	the cardinality of finite set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {S}$\end{document}S, e.g., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\|\mathcal {R}\|=n$\end{document}\|R\|=n, \|U_t\| = k and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\|U_{c}\|=k^{\prime }$\end{document}\|Uc\|=k′
{⋅}_n	the finite set containing n elements, e,g., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {R}=\{R_{1},..., R_{n}\}=\{R_{i}\}_{n}$\end{document}R={R1,...,Rn}={Ri}n
c_n	all unknown factors that may influence β in the inference mechanism of FM
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Pr (c_{n})$\end{document}Pr(cn)	the probability distribution of c_n in the inference mechanism of FM
L_ao	the Euclidean distance from point a to point o coordinate system
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\triangleq $\end{document}≜	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$p(x)\triangleq q(x)$\end{document}p(x)≜q(x) means that function p(x) is equivalent to function q(x)

Key Notations and Descriptions

Inference mechanism and result credibility analysis of FM

In this section, we first introduce the definition of PECQs [3], which is a probabilistic description of the counterfactual query. Second, we review the inference mechanism of FM in Fig. 3. Finally, we exhaustively analyze the inference mechanism in FM by some examples and find that when the probabilistic evaluation of a CQ is 0, the result causes unreliable guidance for decision-making.

Fig. 3

The inference mechanism of FM when evaluating the CQ1

Definition of PECQs

Definition 1

(Probabilistic Evaluation of Counterfactual Queries, PECQs [3]) The core idea of PECQs is to transform a CQ into a probabilistic evaluation problem, which can be formalized as: where “|(α0,β0)” represents the evidence (or observed data) we have observed in the real world, and the value of evidence be considered as a conditional probability (e.g., ). is the counterfactual outcome that we need to infer based on evidence. The probabilistic evaluation of (2) can be obtained by the inference mechanism of FM [3] (i.e,. Figure 3).

Example 1

CQ1 can be translated into (2) for evaluation. Specifically, for (α0,β0), we observe that there is an ineffective policy (i.e., α0) that causes the unemployment rate rise (i.e., β0); indicates the probability of unemployment rate falls (i.e., ) if we implement effective policies (i.e., ).

Inference mechanism of FM

The inference mechanism of FM is shown in Fig. 3. More detailed information on the inference mechanism of FM is elaborated upon in [3], and we will not repeat it here in this section.

Analysis of the inference mechanism of FM

Although FM can output a unique solution for a CQ, however, we find that the results are not credible when the probability estimate of the FM output is 0. In other words, the output value of does not mean that the event will not occur. Next, we introduce some simple examples to reveal the untrustworthy guidance that this ambiguity may bring to the decision-making.

Example 2

CQ2 [36]: Patient has a headache. Will it help if takes aspirin? The information we observe is that the current patient has a headache (denoted as β0) and is not taking aspirin (denoted as α0). Therefore, is equivalent to the probability evaluation of CQ2 (the query of this form like CQ2 can also be called the effects of causes [36]). However, consider a situation (denoted as the variant of CQ2, which is abbreviated as V-CQ2 ) where the patient still does not take aspirin. What is the probability of the headache disappearing? It is equivalent to evaluating . If we still choose to use FM to estimate this query, we first determine the value of ( refers to the value of n, which is determined according to (α0,β0)), and then we determine the new value of ( refers to the value of n, which is determined according to ). Finally, the evaluation of is the sum of and , i.e.,

Why is the evaluation of V-CQ2 equal to 0, and what does this mean?

1) When using FM to estimate the results of CQ1 and V-CQ2, a key step is to calculate the intersection of and , where refers to the set of , which is determined by the observed evidence in the real world, and refers to the set of , which is updated in the counterfactual world. For example, and in CQ1 can be derived from Fig. 3. Therefore, the probabilistic evaluation of CQ1 is uniquely determined by . Hence, the probabilistic evaluation of CQ1 is . However, unlike CQ1, the in V-CQ2 is {3,4}, which causes the probability evaluation of V-CQ2 to be 0 (i.e., (3), because ). This probability estimate is not completely credible. The reason is that we cannot be sure whether the output results are derived from real predictive inferences or the processing of some special counterfactual queries (e.g., V-CQ2) by the inference mechanism. Therefore, when the probabilistic evaluation of a CQ is 0, the decision based on this result is not credible, that is, the result is ambiguous. 2) In addition, in V-CQ2, α0 does not constitute a counterfactual condition, it still belongs to the assumptions in the real world, in this case, the is also known evidence in the real world, i.e., . Hence, we have which contradicts with the result of (3). This shows that α0 does not constitute an intervention that affects the outcome of the counterfactual world. Therefore, the estimated value of (3) obtained by FM violates the counterfactual consistency rule [40].

The impact of ambiguity in inference results on decision-making

We discuss the impact of the unique solution on decision-making by two examples as follows:

Example 3

In predicting the probability value of 0.8 or 0.9 for an earthquake to occur at a certain location, there is little difference in decision-making for this probability. However, when the probability of an earthquake is estimated to be 0 and unique, it is essential for us to verify its rationality, because this may directly lead to the need for the corresponding deployment. In other words, how confident are we to ensure that there will be no earthquake based on the prediction of FM? Therefore, the fact that there exist queries that cannot be answered using FM does not mean that the evaluation of these queries is meaningless.

Example 4

CQ3:The murderer assassinated President Kennedy, if the assassination had failed, would Kennedy still be alive? Formally, if the shot hits the target (α0) with a high probability (p0) that the hit target will die (β0), then we estimate We will eventually get using FM (the prediction process is similar to predicting V-CQ2). Obviously, if the assassination failed (that is, the shot was successfully fired but did not cause the target to die) and Kennedy is still alive, this situation may affect the assassin’s further decisions and deployment. For Kennedy’s team, this may affect the deployment of security measures for similar activities. Therefore, when the estimated result of a CQ is 0, the result cannot provide credible and sufficient opinions for decision-making.

A straightforward solution

Through the above series of analyses, it is not difficult to find that when the probability of a CQ is evaluated as 0, for this situation, further verification and analysis are indispensable. Because the inference mechanism of FM itself will inevitably introduce ambiguity for the evaluation result of . Since the evaluation of the FM determines the final output solution through the intersection between two sets, there is a certain probability that the intersection is an empty set. A straightforward solution is that if an empty set appears in the estimation process, we need to stop using the FM for estimation because the above analysis shows that we cannot define the empty set as . Therefore, when this happens, we should estimate the output probability in the real world instead of the counterfactual world to avoid the appearance of ambiguous results. In this case, plays a role in prompting a replacement prediction strategy. Therefore, to comply with the counterfactual consistency rule, we must use the prior probability (4) (i.e.,1 − p0) to replace .

The mild treatment-unit additivity assumption

For the second reflection in Fig. 2, in this section, we analyze the TUA assumption, which is often used as a strong prerequisite for estimating causal effects in data. We first review the potential outcome framework (Section 4.1), individual causal effect (Section 4.2), the definition of TUA (Section 4.3), and provide an equivalent description of TUA utilizing vectorization (Section 4.4). Second, based on the idea of the Damped Vibration Equation (DVE) [41], we propose a mild TUA assumption (called M-TUA) (Section 4.5). M-TUA not only weakens the original assumption but also has good mathematical properties and interpretability . Our main conclusion in this section is presented based on two lemmas, and the specific proof process is mainly divided into the following two steps. First, we describe the relationship between TUA and ICE in the counterfactual approach, and we explore the equivalence of ICE and r esidual c ausal e ffect (RCE) in the TUA assumption (i.e., Lemma 1). Second, we innovatively introduce the definitions of positive effects and negative effects, and on this basis, we obtain the equivalent form of TUA in vector space by Lemma 2.

Potential outcome framework

According to the viewpoint of Rubin [42], there is an intervention in the causal inference, which means that there is no cause and effect without intervention, and one intervention state corresponds to a potential outcome. When the intervention state is realized, we can only observe the potential outcomes in the realization state, that is, we cannot observe the potential outcomes (i.e, counterfactual outcomes) in the counterfactual world (e.g., in Example 6). This situation where all potential outcomes of units cannot be observed simultaneously is also called FPCI we mentioned earlier. Formally, for binary intervention variables, let d ∈{t = 1,c = 0}, the observation outcome and the potential outcome Y can be expressed by the following formula: Where represents the potential outcome of treatment d ∈{t,c} on unit . For a more intuitive description, we focus on the following 2-dimensional Gaussian distribution model Specifically, we introduce the following example [36] and use it as a basic background for subsequent analysis.

Example 5

Given the pair , , and are i ndependent and i dentically d istributed (i.i.d.), each with the 2-dimensional Gaussian distribution with means (μ,μ), σ = σ = σ (for simplicity of calculation, we assume that the distribution has a common variance σ), and the correlation ρ ∈ (0,1). Furthermore, we use the mixed model to describe the specific structure, i.e., where μ indicates the treatment effects applicable to all units. represents the effect on unit , called unit effects, and this effect applies to all units, i,e., . stands for the effect between treatment and unit, called unit-treatment interaction. This internal mechanism reveals the change from one treatment to another for unit u. and are independent random variables.

Individual causal effect

Dawid [36] adopts the model of (7) to analyze the pros and cons of the counterfactual based on the idea of decision-making and mentions an assumption that is often used in the counterfactual analysis, which is called TUA (Definition 2). As the TUA assumption has strong constraints on data, it will lead to a reduction in the practicability and scope of use of TUA. Hence, in this paper, another goal of a study is to design a mild TUA assumption that constrains the dataset itself or the experimental population as a whole, rather than a strong constraint on each individual, as in the traditional TUA assumption. In the rest of this section, we try to optimize TUA to make it have a broader scope of application in the context of large data. Specifically, we first analyze the individual and average causal effect based on (7). In an experimental study, the ndividual causal effect (ICE) is the basic object (or a basic measure). It describes the differences in various potential outcomes of a given unit under all possible treatments d ∈{t,c}. Generally, for one unit , the ICE can be represented as For different tasks, the ICE can also have other forms of description, such as . Therefore, from a broader perspective, the subtraction in the definition of ICE may not necessarily be a subtraction in . Note that no matter which form is used, only one potential outcome can be observed [43]. Researchers usually do not pay attention to ICE directly, but focus on the average value of the causal effect of all units, that is, ACE, also known as verage treatment effect (ATE). ACE can be expressed by the following formula, Apparently, in (7), 𝜖ACE(u) = μ − μ.

Limitations of the counterfactual approach focused on ICE

We utilize the above Example 5 for our analysis. Specifically, according to (7) and (8), we have that, where is called esidual causal effect (RCE) [36]. It is easy to verify that . Thus, according to (7)-(9), we could obtain the distribution of ICE as follows: However, in (11), 2(1 − ρ)σ cannot be inferred from observed data and has nothing to do with the size of the data. Because even if the marginal distributions of and are known, the joint distribution of random variables cannot be determined, and the marginal distribution of the Gaussian distribution does not depend on the parameter σ. Moreover, according to (7), we have (12) indicates that different values of ρ determine different variances of the distribution of 𝜖(u). We can only get a range of σ, and a different ρ will lead to a different σ, which will cause a variety of uncertain results for reasoning. For example, we can use (11) to estimate the ICE of the new unit u. Because inferring 𝜖(u) is equivalent to inferring 𝜖ACE(u) and 2(1 − ρ)σ under (11). Unfortunately, we cannot accurately determine the value of 2(1 − ρ)σ.

Example 6

(Calculation of causal effect parameters (i.e., ICE, ACE) in the ideal case). In Table 2, we construct a simple example to demonstrate the calculation of the causal effect parameters, such as ICE, ACE. Suppose a population contains four subjects, labeled as u1, u2, u3 and u4, respectively. For each u, the potential outcomes in both intervention states are known (in reality only one potential outcome can be observed). Where individuals 1 and 2 are in the intervention group (i.e., the set of some units receiving treatment t) and individuals 3 and 4 are in the control group (the set of some units receiving treatment c).

Table 2

Causal effect parameters

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	d	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}-O_{c, u_{i}}$\end{document}Ot,ui−Oc,ui
u₁	30	0	30	t	30
u₂	10	10	10	t	0
u₃	10	0	0	c	10
u₄	10	10	10	c	0

Causal effect parameters According to Table 2, we can obtain: Meanwhile, based on the information in Table 2, we can further obtain information on two other causal effect parameters, one is average treatment effect for the treated (ATT) and the other is average treatment effect for the control (ATC). Where, and Unfortunately, in the real world, the boldface numbers (e.g., , ) in Table 2 are not observable to us. The reason is that the treatment received by subject u2 is d = t, we can not observe the potential outcome of u2 receiving treatment d = c at the same time. Therefore, in the real world, the calculation and estimation of the causal effect parameters require additional constraints (e.g., Treatment-unit additivity assumption (Definition 2) to be imposed on the data.

Treatment-unit additivity

In summary, the POF focuses on the inference of causal effects but does not explain the mechanism of influence between variables [44]. A computational bottleneck is the prediction of parameter ρ through the marginal distribution. Therefore, in the task of using the causal model for inference, additional constraints (e.g., Example 7) are usually required to ensure that the inference result is obtained under this constraint.

Example 7

Under the TUA, 𝜖(u) = 𝜖ACE(u) implies that ρ = 1.

Definition 2

(Treatment-Unit Additivity (TUA)[36]). The TUA assumption is to deal with the non-uniformity of data through a strong processing method. Specifically, TUA requirements that 𝜖(u) in has the same effect on all units in , e.g., . TUA can be equivalently regarded as the ssumption of constant effect (AOCE). For example, we can set 𝜖(u) = 𝜖(u)= a specific constant (e.g., 𝜖ACE(u). Generally speaking, AOCE uses the average effect in the sample to estimate the causal effect. Next, we will give a simple example to demonstrate the relation between TUA and ACE and the application of TUA.

Example 8

Considering a fundamental problem of causal inference, let u1 be a patient. We want to know whether certain medication has a therapeutic effect on u1. Suppose that the data about patient u1 is shown in Table 3.

Table 3

The data of u1, where and are unknown

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}-O_{c, u_{i}}$\end{document}Ot,ui−Oc,ui
u₁	13	?	?

The data of u1, where and are unknown According to Table 3, we only know that . Due to the existence of FPCI, we cannot simultaneously observe the effects of u1 taking the medication and not taking the medication. Therefore, we rely on adding additional constraints (i.e., TUA) to estimate the value of O. Suppose we also have additional data (as shown in Table 4), we can then use TUA assumption to infer the values of and (i = 1,2,3,4,5). For example, according to we can obtain the following complete prediction data (see Table 5).

Table 4

Additional information about all u

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}-O_{c, u_{i}}$\end{document}Ot,ui−Oc,ui
u₁	13	?	?
u₂	?	12.5	?
u₃	10	?	?
u₄	?	13	?
u₅	?	12	?
mean	11.5	12.5	− 1

Table 5

Assignment mechanism based on TUA assumption with 𝜖ACE(u) = − 1

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}-O_{c, u_{i}}$\end{document}Ot,ui−Oc,ui
u₁	13	14	−1
u₂	11.5	12.5	− 1
u₃	10	11	− 1
u₄	12	13	− 1
u₅	11	12	− 1
mean	11.5	12.5	− 1

Additional information about all u Assignment mechanism based on TUA assumption with 𝜖ACE(u) = − 1

Equivalent form of TUA

TUA assumes that the causal effect 𝜖(u) has the same effect on all units in , e.g., . Unfortunately, as a commonly used prerequisite, TUA is a strong assumption, which cannot be tested on observable data and lacks a more transparent explanation in the real world [36]. This leads to some interesting questions worth exploring, such as: To address these issues, next, we first provide an equivalent form of the TUA assumption under the 2-dimensional Gaussian distribution (i.e., Lemma 1). For applications of TUA, how to obtain a mild version of the TUA assumption to make the TUA more broadly applicable? For interpretability of TUA, based on the TUA assumption (or a mild TUA), how to establish a formal expression to describe the impact of the main factors inside the data on estimating ICE?

Lemma 1

If the data follows a Gaussian distribution as Example 5, then the TUA assumption has the following equivalent form, i.e., Where , is a sufficiently large positive integer (). .

Proof

Given two units u and u, according to (7) and (8), we have that Hence, a reasonable idea based on (18) is that we can shift our attention from the constraint on to constraint on RCE . Note that the predicted average value of (denoted as ) will be closer to if the size of the data is large enough. Therefore, 𝜖ACE(u) can be identified, from a large experiment, as . This means that the impact of on the data may be related to the size of the data. Given a group containing q units, where means the unit u will receive treatment d. We can assign “treatment” through Randomized Controlled Trials (RCT) and collect all potential outcomes, i.e., and . Suppose that q is a large positive integer and naturally let , we have that Where represents the average of the responses of k units receiving treatment t, and is the average of the responses of q − k units receiving treatment c. q, k, and q − k are both large numbers. Therefore, is estimable and close to the true value. Next, we employ the TUA constraint on (18), which is equivalent to the setting 𝜖(u) − 𝜖(u) = 0. According to (18), it is unnecessary for us to constrain every λ to a fixed value if q is large enough (e.g., ). The alternative solution is that we consider the difference between two Δ(λ), and formally characterize so that it gradually approaches 0 when q is a large number. Therefore, in the case of the considered RCE, we obtain the equivalent form of the TUA assumption, which proves the lemma. □

The properties of in 2-dimensional vector space

Further, we will analyze the properties of TUA in 2-dimensional vector space. Through the above analysis, it is not difficult to find that both the TUA and the equivalent form given by Lemma 5 are only numerical constraints (e.g., 𝜖(u) = 𝜖(u), ). In other words, neither the TUA assumption itself nor Lemma 5 reflects their internal influence on the data. To explore the internal influence of TUA on the data, our core idea is to transform the original TUA assumption of constraints on values (i.e., scalars) into constraints on vectors. Specifically, we analyze the TUA assumption by vectorizing (i.e., Lemma 2) and introducing a definition of the positive and negative effects of (i.e., Definition 3) on the data.

Lemma 2

For any , let denote the positive effect of on the data, and denote the negative effect of on the data. Then the TUA assumption has the following equivalent form in the vector space, i.e., where q+ + q− = q. Before proving Lemma 2, we need to introduce the definition of the vectorization of , positive effects, and negative effects.

Definition 3

(The vectorization of .) Let represent the distance from a certain point a to the point o in the coordinate system (e.g., in Fig. 4a, L represents and L represents ). The vectorization of refers to assigning the characteristics of a vector to to describe the possible positive or negative effect of on the data. As shown in Fig. 4b, for each , As shown in Fig. 4-(c) and (d), for , There is a one-to-one correspondence between positive effects and negative effects. In other words, if a positive effect “+” exists, there must be a negative effect “-” corresponding to it.

Fig. 4

Figures (a) − (d) describe the equivalent representation of the TUA in the vector space by vectorizing . (a) is the geometric description of the traditional TUA assumption in the coordinate system. According to Lemma 1, 𝜖(u) = 𝜖(u) can be regarded as . Hence, in the 2-dimensional plane, we can use Euclidean distance L = L to describe ; (b) describes the vectorization of . According to the definitions of positive (red), negative (blue) effects and the TUA assumption, we have ; (c) describes the vectorization of . It should be noted that the positive and negative effects of on the data are almost equal when the number of samples is large enough. Since , all after vectorization of can form a circle in a 2-dimensional plane; (d) reflects the expansion of TUA assumption in the vector space. It can be regarded as a visualization of the TUA assumption at an abstract level (that is, constraints are applied to the dataset rather than to each u). In other words, it is no longer necessary that

Rationality analysis

According to Definition 3, we transform the original TUA assumption of constraints on the scalars into constraints on vectors. For example, some individuals insist on eating nuts in actual life because nuts are good for their health (i.e., positive effect), but some people are allergic to nuts, and eating them will bring pains and even life-threatening effects (i.e., negative effect). Therefore, we argue that it is necessary to consider the positive or negative effects of λ. Definition 3 provides an intuitive representation of positive/negative effect in the vector space, and according to the definition, next, we give a proof of Lemma 2 as follows. For ease of understanding, we will combine Fig. 4 for the proof. Considering the representation of in a 2-dimensional plane. As shown in Fig. 4a, we first represent as the Euclidean distance in the plane, i.e., According to Lemma 1, 𝜖(u) = 𝜖(u) can be regarded as . Then, we can use L = L to equivalently describe . Second, we consider the representation of the TUA in 2-dimensional vector space. According to Definition 3, we can vectorize λ. The meaning of vectorization is to give each a measure, which aims to describe the positive or negative effects of on the data. In order to maintain consistency with the original TUA assumption, we assume that . For instance, as shown in Fig. 4b, let denote the positive (negative) effect of on the data, although , . Third, we consider extending to the entire dataset. Since the background of our research is in the context of large datasets, we implied a condition here, that is, in the entire data, the positive effects and negative effects on data generation are basically the same. Furthermore, since , we can visualize the entire data as a circle in a 2-dimensional plane, where . Intuitively, under the TUA constraint, always holds. However, does not necessarily have to be under a strong constraint of to hold. In other words, in Fig. 4-(b), it is sufficient that the area of red is the same as the area of blue. Therefore, we can relax the restriction on by only assuming without . In summary, we obtain the following conclusion based on TUA, i.e., which proves the lemma. □ The traditional TUA strongly constrains all (or 𝜖(u)) to be the same for , which undoubtedly ignores the effect of on the data and the estimated ICE. However, ignoring this effect by applying TUA does not mean that the effect of on the data does not exist. Therefore, we did not directly ignore this potential impact but pioneered to represent it by introducing the vectorization method (i.e., positive and negative effects in Definition 3). In addition, Lemma 2 relaxes the constraint on the data to the level of the entire dataset rather than imposing a strong constraint on each unit u. Therefore, Lemma 2 can be considered as an equivalent form of TUA at the abstract level.

The convergence of Δ+(λ) and Δ−(λ)

Through the above analysis, we provide the equivalent form of the TUA, which is based on 2-dimensional Gaussian distribution and a large dataset. By performing vectorization operations on , we introduce the definition of positive and negative effects, respectively, aiming to study the effect of and on the data under the premise . Although we assume that the effects of and are equal in a large dataset, we hope that and will have less and less impact on the data as q approaches . This concern is necessary because if the sample size is not large enough, the positive and negative effects may not cancel each other out. For example, the positive effects may be greater than the negative effects or vice versa. Quantifying and requires rigorous and rational mathematical expressions. Therefore, a natural question is: how to describe the convergence of and when q approaches ? We will give the answers to the above questions in Theorem 1.

The descriptive equation of and

In classical physics, damping refers to the characteristic that the amplitude of vibration gradually decreases in any oscillating system, which may be caused by external influences or the system itself [45]. We introduce the above ideas into the study of the descriptive equation of and . In this section, we provide a description equation about and , which satisfies that when q approaches , and converge strictly to 0 (see Theorem 1).

Theorem 1

For , if there are positive effect and negative effect of on the data, and satisfy (or approximately satisfy) the following equation, where , and η+ > 0, η− > 0 are adjustment parameters. and are attenuation parameters. A+ and A− are the initial values of and , respectively. Then and will gradually converge to 0 as q approaches . Let’s analyze the first term of (26), i.e., where A+ and A− are the initial values of and , respectively. Because of η+ > 0, η− > 0, the two terms and in the equation decay with the data size q. Unfortunately, if the equation only uses (27) to describe the exponential decay trend of and , it cannot reflect the potential impact of and on the data. In other words, and do not necessarily follow a strictly monotonically decreasing function for convergence (see Fig. 5). Therefore, we need to consider the volatility effect of and on the data.

Fig. 5

A visualization of the influence of parameter (A+, η+) on equation . The situation of is similar to the description of

Consider that the influence of and on the data may be volatile. Therefore, we add the term “” to (27) to describe the volatility effect of and on the data. We can rewrite (27) as follows: where n and η+ > 0, η− > 0 are adjustment parameters, and are attenuation parameters. Not only does the function ensure that decays exponentially, but also it ensures that (26) decays. According to Fig. 5, we can intuitively understand the meaning of parameter A+ and parameter η+ in (27). The parameter A+ determines the initial maximum value of the positive effect. The parameter η+ determines the convergence speed of the function . Although can describe that the positive effect converges to 0 quickly as the number of samples increases, it ignores the volatility of positive effects. The proof for is similar. A visualization of the influence of parameter (A+, η+) on equation . The situation of is similar to the description of Similarly, according to Fig. 6, we can intuitively understand the meaning of parameter A+ and parameter η+ in (26). The parameter A+ determines the initial maximum value of the positive effect, the parameter η+ determines the convergence speed of the function , and the reflects the volatility of the positive and negative effect. The purpose of introducing is to reflect the conversion between the positive effect and the negative effect as much as possible. Regarding the form of conversion, it can either be a positive effect that becomes a negative effect or vice versa. However, no matter how it is converted, it will eventually converge to 0 strictly under the . The proof for is similar. □

Fig. 6

A visualization of the influence of parameters (A+, n, η+) and on equation . The situation of is similar to the description of

The rationality analysis of equations and

The rationality analysis of equations and mainly includes two aspects: One is about the analysis of the visualization results of and . The other is the interpretability of and .

The function of

To simplify the presentation, we only analyze positive effects in this subsection. The analysis of negative effects is similar. As shown in Fig. 5, only reflects the nature of exponential decay as q increases. Although also can eventually converge to 0, does not reflect its potential impact on data, because directly describes the positive effect as a strict monotonic decreasing function. However, a representation based on strict monotonic decrement ignores the description of its internal complexities. The effect of on data may be volatile (the situation may also be more complex). Therefore, in order to describe the volatility of , we introduce the function. Apparently, presents a trend of exponential decay with volatility. Finally, as q increases, will strictly converge to zero.

Attenuation parameters

The purpose of introducing the attenuation parameter is to ensure that the positive effect and the negative effect can exhibit exponential decay characteristics as q increases. Although we improve TUA by vectorization, we hope that S(Δ+ and will have minimal impact on the overall data. Therefore, even while acknowledging the existence of positive and negative effects, we hope that and can decay as quickly as possible in an exponential decay manner. In fact, according to Lemma 1, Lemma 2, and Theorem 1, we provide a milder TUA assumption (referred to as M-TUA for short) through vectorization operations. In particular, (26) provides a formal description of positive effects and negative effects, which makes M-TUA interpretable. In summary, the above conclusion provides a mild form of TUA at the abstract level and an explicit (but not unique) mathematical description.

Comparison of TUA and M-TUA

In this section, we compare the traditional TUA and M-TUA to illustrate the similarities and differences between each other. 𝜖(u) and . TUA assumes that the value of ICE is the same for all (), e.g., 𝜖(u) = 𝜖ACE(u), where i ∈ [1,...,q]. M-TUA transfers the above problem to the constraint of by vectorization operation, that is, where q+ + q− = q. Vector and Scalar . M-TUA provides a vector description of positive and negative effects for (i.e., ), aiming to distinguish M-TUA from traditional TUA. The vectorization operation allows for differences between individuals to exist, that is, is allowed under the premise of . Therefore, M-TUA achieves the weakening of TUA. Variance. For a randomized experiment, the assumption of TUA implies that the variance is constant for all treatments. Constant variance is not a necessary condition for MTUA, MTUA should be used in data with small variance to constrain the dispersion of the population. For the intuitiveness of description, we use a simple example to further illustrate how M-TUA weakens the TUA assumption.

Example 9

(Difference between data generated by TUA and M-TUA) TUA is different from M-TUA in a number of respects. A simple goal in this example is to compare the differences in the data under different assumptions via estimating the unobserved potential outcomes from Table 6.

Table 6

Observation data with 𝜖ACE(u) = 1

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}-O_{c, u_{i}}$\end{document}Ot,ui−Oc,ui
u₁	13	?	?
u₂	?	9.5	?
u₃	?	8	?
u₄	?	10	?
u₅	11	?	?
u₆	15	?	?
u₇	?	9.5	?
u₈	9	?	?
u₉	?	10	?
u₁₀	?	9	?
mean	?	?	1

Similar to Example 8, in Table 7, we construct a set of data (including 10 subjects u,i ∈ [1,2,...,10]) that meets the TUA assumption, where

Table 7

Assignment mechanism based on TUA assumption with 𝜖ACE(u) = 1

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}-O_{c, u_{i}}$\end{document}Ot,ui−Oc,ui
u₁	13	12	1
u₂	11.5	10.5	1
u₃	10	9	1
u₄	12	11	1
u₅	11	10	1
u₆	15	14	1
u₇	13	12	1
u₈	9	8	1
u₉	8.5	7.5	1
u₁₀	12	11	1
mean	11.5	10.5	1

Tables 8 and 9 are constructed based on M-TUA assumption.

Table 8

Assignment mechanism based on M-TUA assumption with 𝜖ACE(u) = 1

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\epsilon (u)=O_{t, u_{i}}-O_{c, u_{i}}$\end{document}𝜖(u)=Ot,ui−Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${\Delta }(\lambda _{d, u_{i}})=\lambda _{t, u_{i}}-\lambda _{c, u_{i}}$\end{document}Δ(λd,ui)=λt,ui−λc,ui
u₁	13	11.8	1.2	0.2
u₂	10.7	9.5	1.2	0.2
u₃	9.2	8	1.2	0.2
u₄	11.2	10	1.2	0.2
u₅	11	9.9	1.1	0.1
u₆	15	13.9	1.1	0.1
u₇	10.3	9.5	0.8	− 0.2
u₈	9	8.2	0.8	− 0.2
u₉	10.7	10	0.7	− 0.3
u₁₀	9.7	9	0.7	− 0.3
mean	10.98	9.98	1	0

Table 9

Assignment mechanism based on M-TUA assumption with 𝜖ACE(u) = 1

Subject	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{t, u_{i}}$\end{document}Ot,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O_{c, u_{i}}$\end{document}Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\epsilon (u)=O_{t, u_{i}}-O_{c, u_{i}}$\end{document}𝜖(u)=Ot,ui−Oc,ui	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}${\Delta }(\lambda _{d, u_{i}})=\lambda _{t, u_{i}}-\lambda _{c, u_{i}}$\end{document}Δ(λd,ui)=λt,ui−λc,ui
u₁	13	10.98	2.02	1.02
u₂	11.53	9.5	2.03	1.03
u₃	10.04	8	2.04	1.04
u₄	12.04	10	2.04	1.04
u₅	11	9	2	1.00
u₆	15	15	0	− 1.00
u₇	9.48	9.5	− 0.02	− 1.02
u₈	9	9.03	− 0.03	− 1.03
u₉	9.96	10	− 0.04	− 1.04
u₁₀	8.96	9	− 0.04	− 1.04
mean	11.001	10.001	1	0

Observation data with 𝜖ACE(u) = 1 Assignment mechanism based on TUA assumption with 𝜖ACE(u) = 1 Assignment mechanism based on M-TUA assumption with 𝜖ACE(u) = 1 Assignment mechanism based on M-TUA assumption with 𝜖ACE(u) = 1 As can be seen from Table 7, we know that the data only follows two situations, i.e., (i.e., 𝜖ACE(u) > 0), or (i.e., 𝜖ACE(u) < 0). However, this strong assumption is often violated in the real world, which forces all subjects to have the same 𝜖(u). M-TUA alleviates this scenario and makes it more in line with the complex situations in real data (note that the values of in Tables 8 and 9 are not unique). As shown in Tables 8 and 9, it is not difficult to see that based on the assumption of M-TUA (i.e., ), the data can be more in line with the assignment mechanism on the condition that the ACE value remains unchanged, thereby avoiding either (i.e., 𝜖ACE(u) > 0), or (i.e., 𝜖ACE(u) < 0). For example, according to (10), we have that Further, we obtain that, Since 𝜖ACE(u) = 𝜖ACE(u), in (32), only needs to be satisfied. There are countless equations that satisfy . Example 9 shows that the data constructed based on the M-TUA assumption allows for differences between various u’s (e.g., 𝜖(u7,u8,u9,u10) < 0, 𝜖(u1,2,3,4,5) > 0 and 𝜖(u6) = 0), while ensuring that 𝜖ACE(u) is constant (e.g., 𝜖ACE(u) = 1 ), which is more in line with the diversity of experimental samples in real tasks. However, note that it is not sufficient to simply require that holds, which does not guarantee that the data keeps good dispersion with this constraint. Therefore, an indispensable measure is to introduce variance as a metric to constrain the data so that the data constructed based on M-TUA maintains good dispersion. The reason is that the population is larger and the variance is less, the ACE would be closer to the true ACE regardless of the specific units randomly assigned to treatments. As mentioned above, for a randomized experiment, the TUA implies that the variance is constant for all treatments, which means that a necessary condition for TUA is that the variance is constant, while M-TUA only requires a small value of variance (e.g., the variance of in Table 8 is less than 0.5, and the variance of in Table 9 is close to 1).

Limitations

Although M-TUA has realized the weakening of TUA to a certain extent and expanded the use scope of the original TUA, M-TUA itself is based on some assumptions and finally achieves the equivalence with TUA in the case of large samples, i.e., . Therefore, M-TUA still has the following limitations. Dimensionality limitation of vector space. We take the 2-dimensional Gaussian distribution as an example. Based on Example 5, we analyze the equivalent form of TUA in 2-dimensional vector space. The vectorization operation in 2-dimensional space can easily be extended to 3-dimensional space. However, the equivalent form of the TUA for data in high-dimensional space has not been rigorously established. . As shown in Fig. 4d, M-TUA implies a premise that where q+ + q− = q. It requires a large enough sample size to ensure that the equation holds with a high probability. Because the effects of any may be positive or negative (this is similar to the classical coin toss experiment, when the number of experiments is sufficient, the numbers of positive and negative coin occurrences are basically equal). Decay rate. The in Theorem 1 ensures that (26) will eventually converge to 0 with exponential decay. Of course, the purpose of choosing exponential decay is to make or converge quickly so that as the amount of sample data increases, the impact of or on the data will be minimal (or as small as possible) and eventually reach a negligible level. Ignorability. Since M-TUA is a constraint imposes on the task of making causal inferences in the POF, ignorability(i.e., ()) still needs to hold. In addition, we argue that estimating the variance of the data is still necessary (e.g, Example 9). Because if the population is larger and the variance is less, the ACE would be closer to the true ACE regardless of the specific units randomly assigned to treatment.

Interpretability

Since the TUA cannot be tested and verified on the observed data, this will lead to limitations in the use of many models (e.g., the model of (7)) [36]. Therefore, it is necessary to obtain a milder and interpretable assumption. In general, M-TUA offers several advantages in terms of interpretability as follows: Based on the idea of DVE, we establish the relationship between TUA and RCE and try to provide some reasonable explanations for λ. Through vectorization operations, we endow λ with the ability to describe positive and negative effects on data, and theoretically prove the rationality of M-TUA under the large dataset. M-TUA not only weakens the strength of the original TUA assumption but also provides a geometric description of the TUA. In particular, the M-TUA has an explicit mathematical expression that represents the meaning of the original TUA assumption at an abstract level through a set of interpretable parameters.

Conclusion and future work

In this paper, we first use an example to illustrate the underlying problems of using the functional model to estimate the probability solution of counterfactual queries. We analyze the inference mechanism of the functional model and point out that there are ambiguous conclusions when the unique output probability solution is 0 under the functional model. In other words, when the probability solution obtained by the functional model is 0, it does not mean that the estimated event will not occur. Secondly, for the TUA assumption commonly used in counterfactual models, we provide an equivalent description form of the TUA in the low-dimensional space. We weaken the TUA assumption by vectorizing the original TUA and finally obtain a milder TUA assumption, i.e., M-TUA. In addition, we also give theoretical proof and exhaustive analysis of the rationality and limitations of M-TUA. As pointed out earlier, in M-TUA, the constraints on the unit are related to the dataset and RCE, instead of mandatory constraints for each unit. We argue this is very necessary, especially in the case of big data. Mild version assumption (not just M-TUA) can be viewed as an abstraction from the micro world to the macro world [46]. An intuitive example is that if we want to measure the water temperature of a swimming pool, it is impossible for us to measure every drop of water in the swimming pool. However, we do not think that the conclusion of this paper is the final form of the M-TUA. Therefore, we will focus on the following points in our future work.

Practicality

Causal science has shown vigorous vitality in the field of AI and public health [47]. However, a large number of tasks can only be carried out under the premise of satisfying strong assumptions. The use of some assumptions is also not differentiated according to the different tasks. Therefore, including M-TUA, whether the version for different AI task scenarios can be further developed is a topic worthy of our further consideration.

Challenges posed by high-dimensional data

. As a theoretical exploration of weakening TUA, M-TUA presents the equivalent form of TUA in vector space through vectorization and gives it a certain degree of interpretability. However, with the explosion of data, AI practitioners are confronted with data that are very large in both volume and dimensionality. Although our theorem shows that M-TUA is applicable in the case of big data, high-dimensional data brings new challenges. Therefore, how to develop assumptions based on M-TUA with theoretical guarantees and applicable to high-dimensional data is also the focus of our future work.

5 in total

1. Approximate Causal Abstraction.

Authors: Sander Beckers; Frederick Eberhardt; Joseph Y Halpern
Journal: Uncertain Artif Intell Date: 2019-07

2. Convolutional neural networks and temporal CNNs for COVID-19 forecasting in France.

Authors: Lucas Mohimont; Amine Chemchem; François Alin; Michaël Krajecki; Luiz Angelo Steffenel
Journal: Appl Intell (Dordr) Date: 2021-04-14 Impact factor: 5.086

3. Counterfactual clinical prediction models could help to infer individualized treatment effects in randomized controlled trials-An illustration with the International Stroke Trial.

Authors: Tri-Long Nguyen; Gary S Collins; Paul Landais; Yannick Le Manach
Journal: J Clin Epidemiol Date: 2020-05-25 Impact factor: 6.437

4. Building Bridges Between Structural and Program Evaluation Approaches to Evaluating Policy.

Authors: James J Heckman
Journal: J Econ Lit Date: 2010-06-01

Review 5. Causality matters in medical imaging.

Authors: Daniel C Castro; Ian Walker; Ben Glocker
Journal: Nat Commun Date: 2020-07-22 Impact factor: 14.919

5 in total