Literature DB >> 35327862

Nonparametric Causal Structure Learning in High Dimensions.

Abstract

The PC and FCI algorithms are popular constraint-based methods for learning the structure of directed acyclic graphs (DAGs) in the absence and presence of latent and selection variables, respectively. These algorithms (and their order-independent variants, PC-stable and FCI-stable) have been shown to be consistent for learning sparse high-dimensional DAGs based on partial correlations. However, inferring conditional independences from partial correlations is valid if the data are jointly Gaussian or generated from a linear structural equation model-an assumption that may be violated in many applications. To broaden the scope of high-dimensional causal structure learning, we propose nonparametric variants of the PC-stable and FCI-stable algorithms that employ the conditional distance covariance (CdCov) to test for conditional independence relationships. As the key theoretical contribution, we prove that the high-dimensional consistency of the PC-stable and FCI-stable algorithms carry over to general distributions over DAGs when we implement CdCov-based nonparametric tests for conditional independence. Numerical studies demonstrate that our proposed algorithms perform nearly as good as the PC-stable and FCI-stable for Gaussian distributions, and offer advantages in non-Gaussian graphical models.

Entities: Chemical

Keywords: FCI algorithm; PC algorithm; causal structure learning; consistency; high dimensionality; nonparametric testing

Year: 2022 PMID： 35327862 PMCID： PMC8947566 DOI： 10.3390/e24030351

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Directed acyclic graphs (DAGs) are commonly used to represent causal relationships among random variables [1,2,3]. The PC algorithm [3] is the most popular constraint-based method for learning DAGs from observational data under the assumption of causal sufficiency, i.e., when there are no unmeasured common causes and no selection variables. It first estimates the skeleton of a DAG by recursively performing a sequence of conditional independence tests, and then uses the information from the conditional independence relations to partially orient the edges, resulting in a completed partially directed acyclic graph (CPDAG). In Section 2, we provide a review of these and other notions commonly used in the graphical modeling literature that are relevant to our work. In addition, we refer to estimating the CPDAG as structure learning of the underlying DAG throughout the rest of the paper. Observational studies often involve latent and selection variables, which complicate the causal structure learning problem. Ignoring such unmeasured variables can make the causal inference based on the PC algorithm erroneous; see, e.g., Section 1.2 in [4] for some illustrations. The Fast Causal Inference (FCI) algorithm and its variants [3,4,5,6] utilize similar strategies as the PC algorithm to learn the DAG structure in the presence of latent and selection variables. Both PC and FCI algorithms adopt a hierarchical search strategy—they recursively perform conditional independence tests given subsets of increasingly larger cardinalities in some appropriate search pool. The PC algorithm is usually order-dependent, in the sense that its output depends on the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered. The FCI algorithm suffers from a similar limitation. To overcome this limitation, Ref. [7] proposed two variants of the PC and FCI algorithms, namely the PC-stable and FCI-stable algorithms that resolve the order dependence at different stages of the algorithms. In general, testing for conditional independence is a problem of central importance in the causal structure learning. The literature on the PC and FCI algorithms predominantly uses partial correlations to infer conditional independence relations. It is well-known that the characterization of conditional independence by partial correlations, or, in other words, equivalence between conditional independence and zero partial correlations only holds for multivariate normal random variables. Therefore, the high-dimensional consistency results for the PC and FCI algorithms [4,8] are limited to Gaussian graphical models, where the nodes correspond to random variables with a joint Gaussian distribution. Although the Gaussian graphical model is the standard parametric model for continuous data, it may not hold in many real data applications. Although this limitation can be somewhat relaxed by considering linear structural equation models (SEMs) with general noise distributions [9], linear SEMs and joint Gaussianity are essentially equivalent [10]. Moreover, neither approach is appropriate when the observations are categorical, discrete, or are supported on a subset of the real line. In Section 4.3, for example, we present a real application where all the observed variables are categorical, and therefore far from being Gaussian. As an improvement, ref. [11] used rank-based partial correlations to test for conditional independence relations, showing that the high-dimensional consistency of the PC algorithm holds for a broader class of Gaussian copula models. Some nonparametric versions of the PC algorithm have been also proposed in the literature via kernel-based tests for conditional independence [12,13]; however, they lack theoretical justifications of the correctness of the algorithms, and are not studied in high dimensions. This work aims to broaden the applicability of the PC-stable and FCI-stable algorithms to general distributions by employing a nonparametric test for conditional independence relationships. To this end, we utilize recent developments on dependence metrics that quantify nonlinear and non-monotone dependence between multivariate random variables. More specifically, our work builds on the idea of distance covariance (dCov) proposed by [14] and its extension to conditional distance covariance (CdCov) by [15] as a nonparametric measure of nonlinear and non-monotone conditional independence between two random vectors of arbitrary dimensions given a third. Utilizing this flexibility, we use the conditional distance covariance (CdCov) to test for conditional independence relationships in the sample versions of the PC-stable and FCI-stable algorithms. The resulting algorithms—which, for distinction, are termed nonPC and nonFCI—facilitate causal structure learning from general distributions over DAGs and are shown to be consistent in sparse high-dimensional settings. We establish the consistency of the proposed algorithms using some moment and tail conditions on the variables, without requiring strict distributional assumptions. To our knowledge, the proposed generalizations of PC/PC-stable or the FCI/FCI-stable algorithms provide the first general nonparametric framework for causal structure learning with theoretical guarantees in high dimensions. The rest of the paper is organized as follows: In Section 2, we review the relevant background, including preliminaries on graphical modeling (Section 2.1), an outline of the PC-stable and FCI-stable algorithms (Section 2.2) and a brief overview of dCov and CdCov (Section 2.3). The nonparametric version of the PC-stable algorithm is presented in Section 3.1. As a key contribution of the paper, we establish that the algorithm consistently estimates the skeleton and the equivalence class of the underlying sparse high-dimensional DAG in a general nonparametric framework. We then present the nonparametric version of the FCI-stable algorithm in Section 3.2 and establish its consistency in sparse high-dimensional settings. As the FCI involves the adjacency search of the PC algorithm, any improvement on the PC/PC-stable directly carries over to the FCI/FCI-stable as well. In Section 4, we compare the performances of our algorithms with the PC-stable and FCI-stable using both simulated datasets (involving both Gaussian and non-Gaussian examples), as well as a real dataset. These numerical studies clearly demonstrate that nonPC and nonFCI algorithms are comparable with PC-stable and FCI-stable for Gaussian data and offer improvements for non-Gaussian data.

2. Background

2.1. Preliminaries on Graphical Modeling

We start with introducing some necessary terminologies and background information. Our notations and terminologies follow standard conventions in graphical modeling (see, e.g., [3]). A graph consists of a vertex set and an edge set . In a graphical model, the vertices or nodes are associated with random variables for . Throughout, we index the nodes by the corresponding random variables. We also allow the edge set E of the graph to contain (a subset of) the following six types of edges: → (directed), ↔ (bidirected), − (undirected), (nondirected), (partially undirected) and (partially directed). The endpoints of an edge are called marks, which can be tails, arrowheads or circles. A “∘” at the end of an edge indicates it is not known whether an arrowhead should occur at that place. We use the symbol ‘★’ to denote an arbitrary edge mark; for example, the symbol represents an edge of the type →, ↔ or in the graph. A mixed graph is a graph containing directed, bidirected and undirected edges. A graph containing only directed edges is called a directed graph, one containing only undirected edges is called an undirected graph, and one containing directed and undirected edges is called a partially directed graph. The adjacency set of a vertex in the graph , denoted , is the set of all vertices in V that are adjacent to , or, in other words, are connected to by an edge. The degree of a vertex , , is defined as the number of vertices adjacent to it. A graph is complete if all pairs of vertices in the graph are adjacent. A vertex is called a parent of if , a child of if and a neighbor of if . The skeleton of the graph is the undirected graph obtained by replacing all the edges of by undirected edges, in other words, ignoring all the edge orientations. Three vertices are called an unshielded triple if and are adjacent, and are adjacent, but and are not adjacent. A path is a sequence of distinct adjacent vertices. A node is an ancestor of its descendent, if contains a directed path . A non-endpoint vertex on a path is called a collider on the path if both the edges preceding and succeeding it have an arrowhead at , or, in other words, the path contains . An unshielded triple is called a v-structure if is a collider on the path . A cycle occurs in a graph when there is a path from to , and and are adjacent. A directed path from to forms a directed cycle together with the edge , and it forms an almost directed cycle together with the edge . Three vertices that form a cycle are called a triangle. A directed acyclic graph (DAG) is a directed graph that does not contain any cycle. A DAG entails conditional independence relationships via a graphical criterion called d-separation (Section 1.2.3 in [16]). Two vertices and that are not adjacent in a DAG are d-separated in by a subset . A probability distribution P on is said to be faithful with respect to the DAG if the conditional independence relationships in P can be inferred from using d-separation and vice versa; in other words, if and only if and are d-separated in by . A graph that is both (partially) directed and acyclic is called a partially directed acyclic graph (PDAG). DAGs that encode the same set of conditional independence relations form a Markov equivalence class [17]. Two DAGs belong to the same Markov equivalence class if and only if they have the same skeleton and the same v-structures. A Markov equivalence class of DAGs can be uniquely represented by a completed partially directed acyclic graph (CPDAG), which is a PDAG that satisfies the following: (i) in the CPDAG if in every DAG in the Markov equivalence class, and (ii) in the CPDAG if the Markov equivalence class contains a DAG in which as well as a DAG in which .

2.2. The PC-Stable and FCI-Stable Algorithms

In this section, we provide an outline of the PC/PC-stable and FCI/FCI-stable algorithms. Estimation of the CPDAG by the PC algorithm involves two steps: (1) estimation of the skeleton and separating sets (also called the adjacency search step); and (2) partial orientation of edges; see Algorithms 1 and 2 in [8] for details. Intuitively, the PC algorithm works as follows. In the first step (the adjacency search step), the algorithm starts with a complete undirected graph Then, for conditioning sets of increasing cardinality, , the algorithm removed an edge if and are conditionally independent given a subset S of size k chosen among the current neighbors of nodes a and b. This process continues up to the order , where q is the maximum degree of the underlying DAG. By searching over the neighboring nodes, the algorithm is adaptive and can efficiently infer sparse high-dimensional DAGs, where the sparsity is characterized by the maximum node degree, q. In the presence of latent and selection variables, one needs a generalization of an DAG, called a maximal ancestral graph (MAG). A mixed graph is called an ancestral graph if it contains no directed or almost directed cycles and no subgraph of the type . DAGs form a subset of ancestral graphs. A MAG is an ancestral graph in which every missing edge corresponds to a conditional independence relationship via the m-separation criterion [18], a generalization of the notion of d-separation. Multiple MAGs may represent the same set of conditional independence relations. Such MAGs form a Markov equivalence class which can be represented by a partial ancestral graph (PAG) [19]; see [18] for additional details. Under the faithfulness assumption, the Markov equivalence class of a DAG with latent and selection variables can be learned using the FCI algorithm (e.g., Algorithm 3.1 in [4]), which is a modification of the PC algorithm. The FCI algorithm first employs the adjacency search of the PC algorithm, and then performs additional conditional independence queries because of the presence of latent variables followed by partial orientation of the edges, resulting in an estimated PAG. The FCI algorithm adopts the same hierarchical search strategy as the PC algorithm: It starts with a complete undirected graph and recursively removes edges via conditional independence queries given subsets of increasingly larger cardinalities in some appropriate search pool. The PC algorithm is usually order-dependent, in the sense that its output depends on the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered. The FCI algorithm suffers from a similar limitation, as it shares the adjacency search step of the PC algorithm as its first step. To overcome this limitation, ref. [7] proposed variants of the PC and FCI algorithms, namely the PC-stable and FCI-stable algorithms that resolve the order dependence at different stages of the algorithms. The basic difference between the PC algorithm and the PC-stable algorithm is that, in the adjacency search step, the latter computes and stores the adjacency sets of all the variables after each new cardinality, , of the conditioning sets. These stored adjacency sets are then used to search for conditioning sets of this given size k. As a consequence, the removal of an edge no longer affects which conditional independence relations need to be checked for other pairs of variables at this given size of the conditioning sets. We would refer the reader to Appendix A, where we provide in full detail the pseudocodes of the oracle versions of the PC-stable and FCI-stable algorithms. In the oracle versions of the algorithms, it is assumed that perfect knowledge is available about all the necessary conditional independence relations. As such, conditional independence relations are not estimated from data. Of course, this perfect knowledge is not available in practice. Sample versions of the PC-stable and FCI-stable algorithms can be obtained by replacing the conditional independence queries by a suitable test for conditional independence at some pre-specified level. For example, if the variables are jointly Gaussian, one can test for zero partial correlations (see, e.g., [8]). The next subsection is devoted to discussions on nonparametric tests for independence and conditional independence.

2.3. Distance Covariance and Conditional Distance Covariance

We start by describing the notation used throughout the paper. We denote by the Euclidean norm of and use when the dimension is clear from the context. We use to denote the independence of X and Y and use to denote expectation with respect to the probability distribution of the random variable U. For any set S, we denote its cardinality by . We use the usual asymptotic notation, ‘O’ and ‘o’, as well as their probabilistic counterparts, and , which denote stochastic boundedness and convergence in probability, respectively. For two sequences of real numbers and , if and only if and as . We use the symbol “” to indicate that for some constant . For a matrix , we denote its determinant by and define its -centered version as for . We denote the indicator function of any set A by . Finally, we denote the integer part of by . Ref. [14], in their seminal paper, introduced the notion of distance covariance (dCov, henceforth) to quantify nonlinear and non-monotone dependence between two random vectors of arbitrary dimensions. Consider two random vectors and with and . The distance covariance between X and Y is defined as the positive square root of where , and are the individual and joint characteristic functions of X and Y, respectively, and is a constant with being the complete gamma function. The key feature of dCov is that it completely characterizes the independence between two random vectors, or in other words if and only if . According to Remark 3 in [14], dCov can be equivalently expressed as This alternate expression comes handy in constructing V or U-statistic type estimators for the quantity. For an observed random sample from the joint distribution of X and Y, define the distance matrices and , where and . Following the -centering idea in [20], an unbiased U-statistic type estimator of can be expressed as where and are the -centered versions of the matrices and , respectively, as defined in (1). Ref. [15] generalized the notion of dCov and introduced the conditional distance covariance (CdCov, henceforth) as a measure of conditional dependence between two random vectors of arbitrary dimensions given a third. CdCov essentially replaces the characteristic functions used in the definition of dCov by conditional characteristic functions. Consider a third random vector with . Denote by the conditional joint characteristic function of X and Y given Z, and by and the conditional marginal characteristic functions of X and Y given Z, respectively. Then, CdCov between X and Y given Z is defined as the positive square root of The key feature of CdCov is that almost surely if and only if , which is quite straightforward to see from the definition. Similar to dCov, an equivalent alternative expression can be established for CdCov that avoids complicated integrations involving conditional characteristic functions. Let be an i.i.d. sample from the joint distribution of . Define , which is not symmetric with respect to , and therefore necessitates defining the following symmetric form: . Lemma 1 in [15] establishes an equivalent representation of as In a recent work, [ For , let be a kernel function, where H is the diagonal matrix determined by a bandwidth parameter h. is typically considered to be the Gaussian kernel , where . Let and for . Then, by virtue of the equivalent representation of CdCov in (3), a V-statistic type estimator of can be constructed as Under certain regularity conditions, Theorem 4 in [15] shows that, conditioned on Z, as .

3. Methodology and Theory

3.1. The Nonparametric PC Algorithm in High Dimensions

To obtain a measure of conditional independence between X and Y given Z that is free of Z, we define Clearly, if and only if . Consider a plug-in estimate of as We reject vs at level if , for a suitably chosen threshold . In Appendix A, we present a local bootstrap procedure for choosing in practice, which is also used in our numerical studies. Henceforth, we will often denote and simply by and respectively for notational simplicity, whenever there is no confusion. In view of the complete characterization of conditional independence by , we propose testing for conditional independence relations nonparametrically in the sample version of the PC-stable algorithm based on , rather than partial correlations. We coin the resulting algorithm the ‘nonPC’ algorithm, to emphasize that it is a nonparametric generalization of parametric PC-stable algorithms. The oracle version of the first step of nonPC, or the skeleton estimation step, is exactly the same as that of the PC-stable algorithm (Algorithm A1 in Appendix A). The second step, which extends the skeleton estimated in the first step to a CPDAG (Algorithm A2 in Appendix A), is comprised of some purely deterministic rules for edge orientations, and is exactly the same for both the nonPC and PC-stable as well. The only difference lies in the implementation of the tests for conditional independence relationships in the sample versions of the first step. Specifically, we replace all the conditional independence queries in the first step by tests based on . At some pre-specified significance level , we infer that when , where and , . When , and . The critical value in this case is obtained by a bootstrap procedure (see, e.g., Section 4 in [22] with ). Given that the equivalence between conditional independence and zero partial correlations only holds for multivariate normal random variables, our generalization broadens the scope of applicability of causal structure learning by the PC/PC-stable algorithm to general distributions over DAGs. This nonparametric approach is thus a natural extension of Gaussian and Gaussian copula models. It enables capturing nonlinear and non-monotone conditional dependence relationships among the variables, which partial correlations fail to detect. Next, we establish theoretical guarantees on the correctness of the nonPC algorithm in learning the true underlying causal structure in sparse high-dimensional settings. Our consistency results only require mild moment and tail conditions on the set of variables, without making any strict distributional assumptions. Denote by the maximum cardinality of the conditioning sets considered in the adjacency search step of the PC-stable algorithm. Clearly, , where is the maximum degree of the DAG . For a fixed pair of nodes , the conditioning sets considered in the adjacency search step are elements of . We first establish a concentration inequality that gives the rate at which the absolute difference of and its plug-in estimate decays to zero, for any fixed pair of nodes a and and a fixed conditioning set S. Towards that, we impose the following regularity conditions. There exists such that, for , . The kernel function is non-negative and uniformly bounded over its support. Condition (A1) imposes a sub-exponential tail bound on the squares of the random variables. This is a quite commonly used condition, for example, in the high-dimensional feature screening literature (see, for example, [23]). Condition (A2) is a mild condition on the kernel function that is guaranteed by many commonly used kernels, including the Gaussian kernel. Under conditions (A1) and (A2), the next result shows that the plug-in estimate converges in probability to its population counterpart exponentially fast. Under conditions (A1) and (A2), for any The proof of Theorem 1 is long and somewhat technical; it is thus relegated to Appendix B. Theorem 1 serves as the main building block towards establishing the consistency of the nonPC algorithm in sparse high-dimensional settings. For notational convenience, henceforth, we denote and by and , respectively. In Theorem 2 below, we establish a uniform bound for the errors in inferring conditional independence relationships using the -based test in the skeleton estimation step of the sample version of the nonPC algorithm. Under conditions (A1) and (A2), for any Next, we turn to proving the consistency of the nonPC algorithm in the high-dimensional setting where the dimension p can be much larger than the sample size n, but the DAG is considered to be sparse. We impose the following regularity conditions, which are similar to the assumptions imposed in Section 3.1 of [8] in order to prove the consistency of the PC algorithm for Gaussian graphical models. We let the number of variables p grow with the sample size n and consider , and also the DAG and the distribution . The dimension grows at a rate such that the right-hand side of (7) tends to zero as . In particular, this is satisfied when for any . The maximum degree of the DAG , denoted by , grows at the rate of , where . The distribution is faithful to the DAG for all n. In other words, for any and Moreover, values are uniformly bounded both from above and below. Formally, where is a positive constant and . Condition (A3) allows the dimension to grow at any arbitrary polynomial rate of the sample size. Condition (A4) is a sparsity assumption on the underlying true DAG, allowing the maximum degree of the DAG to also grow, but at a slower rate than n. Since , we also have . Finally, Condition (A5) is the strong faithfulness assumption (Definition 1.3 in [24]) on and is similar to condition (A4) in [8]. This essentially requires to be bounded away from zero when the vertices and are not d-separated by . It is worth noting that the faithfulness assumption alone is not enough to prove the consistency of the PC/PC-stable/nonPC algorithms in high-dimensional settings, and the more stringent strong faithfulness condition is required. For notational convenience, treat which implies Condition (A1) implies The next theorem establishes that the nonPC algorithm consistently estimates the skeleton of a sparse high-dimensional DAG, thereby providing the necessary theoretical guarantees to our proposed methodology. It is worth noting that, in the sample version of the PC-stable and hence the nonPC algorithm, all the inference is done during the skeleton estimation step. The second step that involves appropriately orienting the edges of the estimated skeleton is purely deterministic (see Sections 4.2 and 4.3 in [7]). Therefore, to prove the consistency of the nonPC algorithm in estimating the equivalence class of the underlying true DAG, it is enough to prove the consistency of the estimated skeleton. We include the detailed proof of Theorem 3 in Appendix B. Assume that Conditions (A1)–(A5) hold. Let In the proof of Theorem 3, we consider the threshold

3.2. The Nonparametric FCI Algorithm in High Dimensions

The FCI is a modification of the PC algorithm that accounts for latent and selection variables. Thus, generalizations of the PC algorithm naturally extend to the FCI as well. Similar to nonPC, we propose testing for conditional independence relations nonparametrically in the sample version of the FCI-stable algorithm (Algorithm A3 in Appendix A) based on , instead of partial correlations. We coin the resulting algorithm the ‘nonFCI’ algorithm, to emphasize that it is a generalization of parametric FCI-stable algorithms. Again, the oracle version of the nonFCI is exactly the same as that of the FCI-stable algorithm. The difference is in the implementation of the tests for conditional independence relationships in their sample versions. This broadens the scope of the FCI algorithm in causal structural learning for observational data in the presence of latent and selection variables when Gaussianity is not a viable assumption. More specifically, it enables capturing nonlinear and non-monotone conditional dependence relationships among the variables that partial correlations would fail to detect. Equipped with the theoretical guarantees we established for the nonPC in Section 3.1, we establish below in Theorem 4 the consistency of the nonFCI algorithm for general distributions in sparse high-dimensional settings. Let be a DAG with the vertex set partitioned as , where indexes the set of p observed variables, denotes the set of latent variables and stands for the set of selection variables. Let be the unique MAG over . We let p grow with n and consider , and , where Q is the distribution of . We provide below the definition of possible-D-SEP sets (Definition 3.3 in [4]). Let To prove the consistency of the nonFCI algorithm in sparse high-dimensional settings, we impose the following regularity conditions, which are similar to the assumptions imposed in Section 4 in [4]. The distribution is faithful to the underlying MAG for all n. The maximum size of the possible-D-SEP sets for finding the final skeleton in the FCI-stable algorithm (Algorithm A6 in Appendix A), , grows at the rate of , where . For any and with , assume where is a positive constant and . Suppose conditions (A1)–(A3) and (C3)–(C5) hold. Denote by

4. Numerical Studies

4.1. Performance of the NonPC Algorithm

In this subsection, we compare the performances of the nonPC and the PC-stable algorithms in finding the skeleton and the CPDAG for various simulated datasets. We simulate random DAGs in the following examples and sample from probability distributions faithful to them. (Linear SEM). We first fix a sparsity parameter Λ as follows. First, initialize Λ as a zero matrix. Next, fill every entry in the lower triangle (below the diagonal) of Λ by independent realizations of a Uniform In this scheme, each node has the same expected degree , where m is the degree of a node and follows a Binomial distribution. Using the adjacency matrix , the data are then generated from the following linear structural equation model (SEM): where and are jointly independent. To obtain samples on , we first sample from the three following data-generating schemes. For and , Normal: Generate ’s independently from a standard normal distribution. Copula: Generate ’s as in (1) and then transform the marginals to a distribution. Mixture: Generate ’s independently from a 50–50 mixture of a standard normal and a standard Cauchy distribution. (Nonlinear SEM). In this example, we first generate a Λ in the similar way as in Example 1 and then generate the data from the following nonlinear SEM (similar to [ With the exception of Example 1.1, the above examples are all non-Gaussian graphical models. We would thus expect the nonPC to perform better than the PC-stable in learning the unknown causal structure in these examples. For each of the four data generating methods considered above, we compare the Structural Hamming Distance (SHD) [25] between the estimated and the true skeletons of the underlying DAGs using the nonPC and PC-stable algorithms. The SHD between two undirected graphs is the number of edge additions or deletions necessary to make the two graphs match. Therefore, larger SHD values between the estimated and the true skeleton correspond to worse estimates. We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of our nonPC algorithm and the significance level . Table 1 presents the average SHD for the different data generating schemes over 20 simulation runs, for different choices of and .

Table 1

Comparison of the average structural Hamming distances (SHD) of nonPC and PC-stable algorithms across simulation studies.

			Normal		Copula
n	p	E(m)	nonPC	PC-stable	nonPC	PC-stable
50	9	1.4	3.35	3.05	5.55	5.75
100	27	2.0	14.55	11.00	25.6	28.6
150	81	2.4	53.70	43.45	97.3	121.3
200	243	2.8	186.2	183.4	331.00	471.45
			Mixture		Nonlinear SEM
n	p	E(m)	nonPC	PC-stable	nonPC	PC-stable
50	9	1.4	3.8	3.5	2.9	3.7
100	27	2.0	17.75	18.00	15.05	20.05
150	81	2.4	69.05	77.75	62.583	95.083
200	243	2.8	250.3	336.1	213.70	375.45

The results in Table 1 demonstrate that the nonPC performs nearly as good as the PC-stable for the Gaussian data example, in terms of the average SHD. However, for each of the non-Gaussian data examples, the nonPC performs better than the PC-stable in estimating the true skeleton of the underlying DAGs. The improvement in SHD becomes more substantial as the dimension grows. The superior performance of the nonPC over PC-stable for the non-Gaussian graphical models is expected, as the characterization of conditional independence by partial correlations is only valid under the assumption of joint Gaussianity.

4.2. Performance of the NonFCI Algorithm

In this subsection, we compare the performances of the nonFCI and the FCI-stable algorithms over various simulated datasets. We first generate random DAGs as in Examples 1 and 2. To assess the impact of latent variables, we randomly define half of the variables with no parents and at least one child as latent. We do not consider selection variables. We run both the nonFCI and the FCI-stable algorithms on the above data examples with , and , using 199 bootstrap replicates for the CdCov-based conditional independence tests. We consider 20 simulation runs for each of the data generating models. Table 2 reports the average SHD between the estimated and true PAG skeleton by the nonFCI and FCI-stable algorithms.

Table 2

Comparison of the average structural Hamming distances (SHD) of nonFCI and FCI-stable algorithms across simulation studies.

		Normal		Copula		Mixture		Nonlinear SEM
p	E(m)	nonFCI	FCI-Stable	nonFCI	FCI-Stable	nonFCI	FCI-Stable	nonFCI	FCI-Stable
10	2.0	7.15	7.60	1.3	1.8	5.65	6.80	7.15	8.20
20	2.0	14.55	17.60	4.55	6.85	13.65	18.55	19.0	20.8
30	2.0	27.65	33.95	5.25	10.15	19.3	27.8	33.40	37.85
100	3.0	109.30	150.35	26.95	60.05	62.25	111.10	115.2	149.0
200	3.0	287.75	371.40	76.733	157.267	136.05	255.10	289.6	354.1

The results in Table 2 demonstrate that, in both the Gaussian and non-Gaussian examples, the nonFCI algorithm outperforms the FCI-stable in estimating the true PAG skeleton.

4.3. Real Data Example

A major difficulty in assessing whether nonPC and nonFCI provide more reasonable estimates compared to the parametric versions of the algorithms in high-dimensional real data settings is that the true causal graph is not known in most of the cases. In absence of the truth, we may only be able to draw some conclusions about sensible causal mechanisms by examining known or logical relationships among pairs of variables. However, this becomes increasingly difficult for larger networks, where even visualization becomes challenging. This is why we first choose a relatively smaller dataset in Section 4.3.1, where we can draw upon background knowledge to glean insight into potential causal mechanisms in a setting where the data are clearly non-Gaussian. This example highlights the main focus of the paper that, with non-Gaussian data (categorical, as in this example), nonPC is expected to perform better than the PC-stable in learning the true causal structure of the underlying DAG. In Section 4.3.2, we consider a larger example and examine the performance of PC-stable and nonPC in learning the DAG from both seemingly Gaussian data as well as a categorized version of the same data. This example clearly illustrates the potential limitations of PC-stable: in contrast to nonPC, the output of PC-stable can be strikingly different when applied to a categorized version of the original data.

4.3.1. Montana Poll Dataset

To demonstrate the flexibility of our proposed framework, we first apply the nonPC algorithm to the Montana Economic Outlook Poll dataset. The poll was conducted in May 1992 where a random sample of 209 Montana residents were asked whether their personal financial status was worse, the same or better than a year ago, and whether they thought the state economic outlook was better than the year before. Accompanying demographic information on the respondents’ age, income, political orientation, and area of residence in the state were also recorded. We obtained the dataset from the Data and Story Library (DASL), available at https://math.tntech.edu/e-stat/DASL/page4.html (accessed on 25 March 2021). The study is comprised of the following seven categorical variables: AGE = 1 for under 35, 2 for 35–54, 3 for 55 and over; SEX = 0 for male, 1 for female; INC = yearly income: 1 for under $20 K, 2 for $20–35 K, 3 for over $35 K; POL = 1 for Democrat, 2 for Independent, 3 for Republican; AREA = 1 for Western, 2 for Northeastern, 3 for Southeastern Montana; FIN (=Financial status): 1 for worse, 2 for same, 3 for better than a year ago; and STAT (=State economic outlook): 1 for better, 0 for not better than a year ago. After removing the cases with missing values, we are left with samples. Since all the variables are categorical, the Gaussianity assumption is outrightly violated. Thus, we would expect the nonPC to perform better than the PC-stable in learning the true causal structure among the variables in this case. Figure 1 below presents the CPDAGs estimated by the nonPC and PC-stable algorithms at a significance level . We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of the nonPC algorithm.

Figure 1

CPDAGs estimated by the nonPC and PC-stable algorithms for the Montana poll dataset.

It is quite intuitive that age and sex are likely to affect the income; one’s financial status and the area of residence might also influence their political inclination; and improvements or downturns in the state economic outlook might impact an individual’s financial status. The CPDAG estimated by the nonPC algorithm in Figure 1a affirms such common-sense understanding of these causal influences. However, in the CPDAG estimated by the PC-stable in Figure 1b, the edge between age and income is missing. In addition, the directed edges POL → AREA and POL → FIN seem to make little sense in this case.

4.3.2. Protein Expression Data

We next consider a protein expression dataset of 410 patients with breast cancer from The Cancer Genome Atlas (TCGA). The dataset consists of genes, and we randomly select a subset of patients with PR-negative status. Since the true causal structure of the genes in the cancer cells may be different than that of normal cells [26], we apply both the nonPC and PC-stable algorithms to learn the causal structure. To put the performances of the nonPC and PC-stable under scrutiny as the data depart farther away from Gaussianity, we categorize the protein expression data for each of the p genes, denoted by , , as follows. We compute the three quartiles and of the protein expression values for every . Consequently, we obtain categorized protein expressions for , where We apply the nonPC and PC-stable algorithms to both the original and the categorized protein expression data at a significance level . We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of the nonPC algorithm. Table 3 below shows the SHD between the skeletons estimated from the original and the categorized data by the nonPC and PC-stable algorithms. It can be seen that the SHD between the skeletons estimated from the original and categorized data by the PC-stable algorithm is much larger than that for nonPC. This example highlights the potential limitation of parametric implementations of the PC algorithm: when the data deviate farther away from Gaussianity (in this case, being categorical), the estimates produced by the PC-stable may deviate considerably more from the estimates from the original data. In contrast, the nonparametric test in nonPC delivers more stable estimates regardless of the data distribution.

Table 3

Comparison of the SHD between the skeletons estimated from the original and the categorized protein expression data by the nonPC and PC-stable algorithms.

nonPC	PC-Stable
22	79

5. Discussion

We proposed nonparametric variants of the widely popular PC-stable and FCI-stable algorithms, which employ conditional distance covariance (CdCov) to test for conditional independence relationships in their sample versions. Our proposed algorithms broaden the applicability of the PC/PC-stable and FCI/FCI-stable algorithms to general distributions over DAGs, and enable taking into account nonlinear and non-monotone conditional dependence among the random variables, which partial correlations fail to capture. We show that the high-dimensional consistency of the PC-stable and FCI-stable algorithms carry over to more general distributions over DAGs when we implement CdCov-based nonparametric tests for conditional independence. These results are obtained without imposing any strict distributional assumptions and only require moment and tail conditions on the variables. There are several intriguing potential directions for future research. First, it is generally difficult to select the tuning parameter (i.e., the significance threshold for the CdCov test) in causal structure learning. One possible strategy is to use ideas based on stability selection [27,28]. By assessing the stability of the estimated graphs in multiple subsamples, this strategy allows us to choose the tuning parameter in order to control the false positive error. However, the repeated subsampling increases the computational burden. Second, the computational and sample complexities of the PC and FCI algorithms (and hence those of the nonPC and nonFCI) scale with the maximum degree of the DAG, which is assumed to be small relative to the sample size. However, in many applications, one encounters sparse graphs containing a small number of highly connected ‘hub’ nodes. In such cases, ref. [29] proposed a low-complexity variant of the PC algorithm, namely the reduced PC (rPC) algorithm that exploits the local separation property of large random networks [30]. The rPC is shown to consistently estimate the skeleton of a high-dimensional DAG by conditioning only on sets of small cardinality. More recently, ref. [31] have generalized this approach to account for unobserved confounders. In this light, it would be intriguing to develop computationally faster variants of the nonPC and nonFCI in the future by exploiting the idea of local separation.

4 in total