Literature DB >> 33265171

Minimising the Kullback-Leibler Divergence for Model Selection in Distributed Nonlinear Systems.

Oliver M Cliff^1,2, Mikhail Prokopenko², Robert Fitch^1,3.

Abstract

The Kullback-Leibler (KL) divergence is a fundamental measure of information geometry that is used in a variety of contexts in artificial intelligence. We show that, when system dynamics are given by distributed nonlinear systems, this measure can be decomposed as a function of two information-theoretic measures, transfer entropy and stochastic interaction. More specifically, these measures are applicable when selecting a candidate model for a distributed system, where individual subsystems are coupled via latent variables and observed through a filter. We represent this model as a directed acyclic graph (DAG) that characterises the unidirectional coupling between subsystems. Standard approaches to structure learning are not applicable in this framework due to the hidden variables; however, we can exploit the properties of certain dynamical systems to formulate exact methods based on differential topology. We approach the problem by using reconstruction theorems to derive an analytical expression for the KL divergence of a candidate DAG from the observed dataset. Using this result, we present a scoring function based on transfer entropy to be used as a subroutine in a structure learning algorithm. We then demonstrate its use in recovering the structure of coupled Lorenz and Rössler systems.

Entities: Chemical Disease Gene

Keywords: Kullback–Leibler divergence; complex networks; information theory; model selection; nonlinear systems; state space reconstruction; stochastic interaction; transfer entropy

Year: 2018 PMID： 33265171 PMCID： PMC7512642 DOI： 10.3390/e20020051

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Distributed information processing systems are commonly studied in complex systems and machine learning research. We are interested in inferring data-driven models of such systems, specifically in the case where each subsystem can be viewed as a nonlinear dynamical system. In this context, the Kullback–Leibler (KL) divergence is commonly used to measure the quality of a statistical model [1,2,3]. When a model is compared with fully observed data, computing the KL divergence can be straightforward. However, in the case of spatially distributed dynamical systems, where individual subsystems are coupled via latent variables and observed through a filter, the presence of hidden variables renders typical approaches unusable. We derive the KL divergence in such systems as a function of two information-theoretic measures using methods from differential topology. The model selection problem has applications in a wide variety of areas due to its usefulness in performing efficient inference and understanding the underlying phenomena being studied. Dynamical systems are an expressive model characterised by a map that describes their evolution over time and a read-out function through which we observe the latent state. Our research focuses on the more general case of a multivariate system, where a set of these subsystems are distributed and unidirectionally coupled to one another. The problem of inferring this coupling is an important multidisciplinary study in fields such as ecology [4], neuroscience [5,6], multi-agent systems [7,8,9], and various others that focus on artificial and biological networks [10]. We represent such a spatially distributed system as a probabilistic graphical model termed a synchronous graph dynamical system (GDS) [11,12], whose structure is given by a DAG. Model selection in this context is the problem of inferring directed relationships between hidden variables from an observed dataset, also known as structure learning. A main challenge in structure learning for DAGs is the case where variables are unobserved. Exact methods are known for fully observable systems (i.e., Bayesian network (BNs)) [13]; however, these are not applicable in the more expressive case when the state variables in dynamical systems are latent. The main focus of this paper is to analytically derive a measure for comparing a candidate graph to the underlying graph that generated a measured dataset. Such a measure can then be used to solve the two subproblems that comprise structure learning, evaluation and identification [14], and hence find the optimal model that explains the data. For the evaluation problem, it is desirable to select the simplest model that incorporates all statistical knowledge. This concept is commonly expressed via information theory, where an established technique is to evaluate the encoding length of the data, given the model [1,15,16]. The simplest model should aim to minimise code length [2], and therefore we can simplify our problem to that of minimising KL divergence for the synchronous GDS. Using this measure, we find a factorised distribution (given by the graph structure) that is closest to the complete (unfactorised) distribution. We first analytically derive an expression for this divergence, and build on this result to present a scoring function for evaluating candidate graphs based on a dataset. The main result of this paper is an exact decomposition of the KL divergence for synchronous GDSs. We show that this measure can be decomposed as the difference between two well-known information-theoretic measures, stochastic interaction [17,18] and collective transfer entropy [19]. We establish this result by first representing discrete-time multivariate dynamical systems as dynamic Bayesian network (DBNs) [20]. In this form, both the complete and factorised distributions cannot be directly computed due to the hidden system state. Thus, we draw on state space reconstruction methods from differential topology to reformulate the KL divergence in terms of computable distributions. Using this expression, we show that the maximum transfer entropy graph is the most likely to have generated the data. This is experimentally validated using toy examples of a Lorenz–Rössler system and a network of coupled Lorenz attractors (Figure 1) of up to four nodes. These results support the conjecture that transfer entropy can be used to infer effective connectivity in complex networks.

Figure 1

Trajectory of a pair of coupled Lorenz systems. Top row: original state of the subsystems. Bottom row: time-series measurements of the subsystems. In each figure, the black lines represent an uncoupled simulation (), and teal lines illustrate a simulation where the first (leftmost) subsystem was coupled to the second (). (a) ; (b) .

2. Related Work

Networks of coupled dynamical systems have been introduced under a variety of terms, such as complex networks [10], distributed dynamical systems [6] and master–slave configurations [21]. The defining feature of these networks is that the dynamics of each subsystem are given by a set of either discrete-time maps or first-order ordinary differential equation (ODEs). In this paper, we use the discrete-time formulation, where a map can be obtained numerically by integrating ODEs or recording observations at discrete-time intervals [22]. An important precursor to network reconstruction is inferring causality and coupling strength between complex nonlinear systems. Causal inference is intractable when the experimenter can not intervene with the dataset [23], and so we focus our attention on methods that determine conditional independence (coupling) rather than causality. In seminal work, Granger [24] proposed Granger causality for quantifying the predictability of one variable from another; however, a key requirement of this measure is linearity of the system, implying subsystems are separable [4]. Schreiber [25] extended these ideas and introduced transfer entropy using the concept of finite-order Markov processes to quantify the information transfer between coupled nonlinear systems. Transfer entropy and Granger causality are equivalent for linearly-coupled Gaussian systems (e.g., Kalman models) [26]; however, there are clear distinctions between the concepts of information transfer and causal effect [27]. Although transfer entropy has received criticism over spuriously identifying causality [28,29,30], we are concerned with statistical modelling and not causality of the underlying process. Recently, a number of measures have been proposed to infer coupling between distributed dynamical systems based on reconstruction theorems. Sugihara et al. [4] proposed convergent cross-mapping that involves collecting a history of observed data from one subsystem and uses this to predict the outcome of another subsystem. This history is the delay reconstruction map described by Takens’ Delay Embedding Theorem [31]. Similarly, Schumacher et al. [6] used the Bundle Delay Embedding Theorem [32,33] to infer causality and perform inference via Gaussian processes. Although the algorithms presented in these papers can infer driving subsystems in a spatially distributed dynamical system, the results obtained differ from ours as inference is not considered for an entire network structure, nor is a formal derivation presented. Contrasting this, we recently derived an information criterion for learning the structure of distributed dynamical systems [12]. However, the criterion proposed required parametric modelling of the probability distributions, and thus a detailed understanding of the physical phenomena being studied. In this paper, we extend this framework by first showing that KL divergence can be decomposed as information-theoretically useful measures, and then arriving at a similar result but employing non-parametric density estimation techniques to allow for no assumptions about the underlying distributions. It is important to distinguish our approach from dynamic causal modelling (DCM), which attempts to infer the parameters of explicit dynamic models that cause (generate) data. In DCM, the set of potential models is specified a priori (typically in the form of ODEs) and then scored via marginal likelihood or evidence. The parameters of these models include effective connectivity such that their posterior estimates can be used to infer coupling among distributed dynamical systems [34]. As a consequence, these approaches can be used to recover networks that reveal the effective structure of observed systems [35,36]. In contrast, our approach does not require an explicitly specified model because the scoring function can be computed directly from the data. However, it does assume an implicit model in the form of a DAG where the subsystem processes are generated by generic functions. Unlike effective connectivity, which is defined in relation to a (dynamic causal) model, the concept of functional connectivity refers to recovering statistical dependencies [37]. Consequently, statistical measures such as Granger causality and transfer entropy are typically used to identify functional, rather than effective structure. For example, transfer entropy has been used previously to infer networks in numerous fields, e.g., computational neuroscience [5,38], multi-agent systems [8], financial markets [39], supply-chain networks [40], and biology [41]. However, most of these results build on the work of Schreiber [25] by assuming the system is composed of finite-order Markov chains and thus there is a dearth of work that provides formal derivations for the use of this measure for inferring effective connectivity. Our work allows us to compute scoring functions directly from multivariate time series (as in functional connectivity), yet still assumes an implicit model (albeit with weaker assumptions on the model than those considered in inferring effective connectivity).

3. Background

3.1. Notation

We use the convention that denotes a sequence, a set, and a vector. In this work, we consider a collection of stationary stochastic temporal processes . Each process comprises a sequence of random variables with realisation for countable time indices . Given these processes, we can compute probability distributions of each variable by counting relative frequencies or by density estimation techniques [42,43]. We use bold to denote the set of all variables, e.g., is the collection of M realisations at index n. Furthermore, unless otherwise stated, is a latent (hidden) variable, is an observed variable, and is an arbitrary variable; thus, is the set of all hidden and observed variables at temporal index n. Given a graphical model G, the parents of variable are given by the parent set . Finally, let the superscript denote the vector of k previous values taken by variable .

3.2. Representing Distributed Dynamical Systems as Probabilistic Graphical Models

We are interested in modelling discrete-time multivariate dynamical systems, where the state is a vector of real numbers given by a point lying on a compact d-dimensional manifold . A map describes the temporal evolution of the state at any given time, such that the state at the next time index . Furthermore, in many practical scenarios, we do not have access to directly, and can instead observe it through a measurement function that yields a scalar representation of the latent state [22,44]. We assume the multivariate system can be factorised and modelled as a DAG with spatially distributed dynamical subsystems, termed a synchronous GDS (see Figure 2a). This definition is restated from [12] as follows.

Figure 2

Representation of (a) the synchronous GDS with two vertices ( and ), and (b) the rolled-out DBN of the equivalent structure. Subsystems and are coupled by virtue of the edge .

A synchronous GDS The global dynamics and observations can therefore be described by the set of local functions [12]: where and are additive noise terms. The subsystem dynamics (1) are a function of the subsystem state and the subsystem parents’ state at the previous time index, i.e., . However, the observation is a function of the subsystem state alone, i.e., . We assume that the maps and , as well as the graph G, are time-invariant. The discrete-time mapping for the dynamics (1) and measurement functions (2) can be modelled as a DBN in order to facilitate structure learning of the graph [12] (see Figure 2b). DBNs are a probabilistic graphical model that represent probability distributions over trajectories of random variables using a prior BN and a two-time-slice BN (2TBN) [45]. To model the maps, however, we need only to consider the 2TBN , which can model a first-order Markov process graphically via a DAG G and a set of conditional probability distribution (CPD) parameters [45]. Given a set of stochastic processes , the realisation of which constitutes the sample path , the 2TBN distribution is given by where denotes the (index-ordered) set of realisations . To model the synchronous GDS as a DBN, we associate each subsystem vertex with a state variable and an observation variable . The parents of subsystem are denoted [12]. From the dynamics (1), variables in the set come strictly from the preceding time slice, and additionally, from the measurement function (2), . Thus, we can build the edge set in the GDS by means of the edges in the DBN [12], i.e., given an edge of the DBN, the equivalent edge exists for the GDS. The distributions for the dynamics (1) and observation (2) maps of M arbitrary subsystems can therefore be factorised according to the DBN structure such that [12] The goal of learning nonlinear dynamical networks thus becomes that of inferring the parent set for each latent variable . Finally, recall that the parents of each observation are constrained such that . As a consequence, we use the shorthand notation to denote the observation of the j-th parent of the i-th subsystem at time n (and the same for ).

3.3. Network Scoring Functions

A number of exact and approximate DBN structure learning algorithms exist that are based on Bayesian statistics and information theory. We have shown in prior work how to compute the log-likelihood function for synchronous GDSs. In this section, we will briefly summarise the problem of structure learning for DBNs, focusing on the factorised distribution (3). The score and search paradigm [46] is a common method for recovering graphical models from data. Given a dataset , the objective is to find a DAG such that where : is a scoring function measuring the degree of fitness of a candidate DAG G to the data set D, and is the set of all DAGs. Finding the optimal graph in Equation (4) requires solutions to the two subproblems that comprise structure learning: the evaluation problem and the identification problem [14]. The main problem we focus on in this paper is the evaluation problem, i.e., determining a score that quantifies the quality of a graph, given data. Later, we will address the identification problem by discussing the attributes of this scoring function in efficiently finding the optimal graph structure. In prior work, we developed a score based on the posterior probability of the network structure G, given data D. That is, we considered maximising the expected log-likelihood [12] where the expectation . It was shown that state space reconstruction techniques (see Appendix A) can be used to compute the log-likelihood of Equation (3) as a difference of conditional entropy terms [12]. In the same work, we illustrated that the log-likelihood ratio of a candidate DAG G to the empty network is given by collective transfer entropy (see Appendix B), i.e., For the nested log-likelihoods above, the statistics of asymptotically follow the -distribution, where q is the difference between the number of parameters of each model [47,48]. We will draw on this log-likelihood decomposition in later sections for statistical significance testing.

4. Computing Conditional KL Divergence

In this section, we present our main result, which is an analytical expression of KL divergence that facilitates structure learning in distributed nonlinear systems. We begin by considering the problem of finding an optimal DBN structure as searching for a parsimonious factorised distribution that best represents the complete digraph distribution . That is, is the joint distribution yielded by assuming no factorisation (the complete graph ) and thus no information loss. The distribution is expressed as: We quantify the similarity of the factorised distribution to this joint distribution via KL divergence. In prior work, De Campos [3] derived the MIT scoring function for BNs by this approach and it was later used for DBN structure learning with complete data [49]. We extend the analysis to DBNs with latent variables, i.e., we compare the joint and factorised distributions of time slices, given the entire history, Substituting the synchronous GDS model (3) into Equation (8), we get However, Equation (9) comprises maximum likelihood distributions with unobserved (latent) states . It is common in model selection to decompose the KL divergence as where the second term is simply the log-likelihood (5). In this form, is often identical for all models considered and, in practice, it suffices to ignore this term and thus avoid the problem of computing distributions of latent variables. The resulting simpler expression can be viewed as log-likelihood maximisation (as in our previous work outlined in Section 3.3). However, as we show in this section, is not equivalent for all models unless certain parameters of the dynamical systems are known. Hence, for now, we cannot ignore the first term of Equation (10) and we instead propose an alternative decomposition of KL divergence that comprises only observed variables.

4.1. A Tractable Expression via Embedding Theory

In order to compute the distributions in (9), we use the Bundle Delay Embedding Theorem [32,33] to reformulate the factorised distribution (denominator), and the Delay Embedding Theorem for Multivariate Observation Functions [50] for the joint distribution (numerator). We describe these theorems in detail in Appendix A, along with the technical assumptions required for . Although the following theorems assume a diffeomorphism, we also discuss application of the theory towards inferring the structure of endomorphisms (e.g., coupled map lattices [51]) in the same appendix. The first step is to reproduce a prior result for computing the factorised distribution (denominator) in Equation (9). First, the embedding where is the (strictly positive) lag, and is the embedding dimension of the i-th subsystem (the embedding parameters). Note that, although we can take either the future or past delay embedding (11) for diffeomorphisms, we explicitly consider a history of values to account for both endomorphisms and diffeomorphisms. Moreover, an important assumption of our approach is that the the structure (enforced by coupling between subsystems) is a DAG; this comes from the Bundle Delay Embedding Theorem [32,33] (see Lemma 1 of [12] for more detail). Our previous result is expressed as follows. Given an observed dataset D, where Next, we present a method for computing the joint distribution (numerator) in Lemma 3. For convenience, Lemma 2 restates part of the delay embedding theorem in [50] in terms of subsystems of a synchronous GDS and establishes existence of a map for predicting future observations from a history of observations. Consider a diffeomorphism on a d-dimensional manifold , where the multivariate state consists of M subsystem states . Each subsystem state is confined to a submanifold of dimension , where . The multivariate observation is given, for some map , by . The proof restates part of the proof of Theorem 2 of Deyle and Sugihara [50] in terms of subsystems. Given M inhomogeneous observation functions , the following map is an embedding where each subsystem (local) map , smoothly (at least ), and, at time index n is described by where [50]. Note that, from (13) and (14), we have the global map Now, since is an embedding, it follows that the map is well defined and a diffeomorphism between two observation sequences , i.e., The last components of are trivial, i.e., the set is observed; denote the first M components by , and then we have . ☐ We now use the result of Lemma 2 to obtain a computable form of the KL divergence. Consider a discrete-time multivariate dynamical system with generic Lemma 1, we can substitute (12) into (9), and express the KL divergence as We now focus on . Using the chain rule, Given the Markov property of the dynamics (1) and observation (2) maps, we get Now, recall fom Lemma 2 that global equations for the entire system state and observation are Given the assumption of i.i.d noise on the function f, from (18), we express the probability of the dynamics , given by the embedding, as By assumption, the observation noise is i.i.d or dependent only on the state , and thus the probability of observing , from (19) is By (20) and (21), we have that Substituting Equation (22) into (17) gives Finally, substituting (23) back into (16) yields the statement of the theorem. ☐ Given all variables in (15) are observed, it is now straightforward to compute KL divergence; however, as we will see, it is more convenient to express (15) as a function of known information-theoretic measures.

4.2. Information-Theoretic Interpretation

The main theorem of this paper states KL divergence in terms of transfer entropy and stochastic interaction. These information-theoretic concepts are defined in Appendix B for convenience. Consider a discrete-time multivariate dynamical system with generic We can reformulate the KL divergence in (15) as Substituting in the definitions of transfer entropy (A8) and stochastic interaction (A9) completes the proof. ☐ To conclude this section, we present the following corollary showing that, when we assume a maximum or fixed embedding dimension and time delay , it suffices to maximise the collective transfer entropy alone in order to minimise KL divergence for a synchronous GDS. Fix an embedding dimension The first term of (24) is constant, given a constant vertex set , time delay and embedding dimension and is thus unaffected by the parent set of a variable. As a result, does not depend on the graph G being considered, and, therefore, we only need to consider transfer entropy when optimising KL divergence (24). ☐ As mentioned above, Corollary 1 is, in practice, equivalent to the maximum log-likelihood (5) and log-likelihood ratio (6) approaches. However, the statement only holds for constant embedding parameters. In the general case, where these parameters are unknown, one requires Theorem 4 to perform structure learning. Given this result, we can now confidently derive scoring functions from Corollary 1.

5. Application to Structure Learning

We now employ the results above in selecting a synchronous GDS that best fits data generated by a multivariate dynamical system. The most natural way to find an optimal model based on Theorem 4 is to minimise KL divergence. Here, we assume constant embedding parameters and use Corollary 1 to present the transfer entropy score and discuss some attributes of this score. We then use this scoring function as a subroutine for learning the structure of coupled Lorenz and Rössler attractors. From Corollary 1, a naive scoring function can be defined as Given parameterised probability distributions, this score is insufficient, since the sum of transfer entropy in (27) is non-decreasing when including more parents in the graph [38]. Thus, we use statistical significance tests in our scoring functions to mitigate this issue.

5.1. Penalising Transfer Entropy by Independence Tests

Building on the maximum likelihood score (27), we propose using independence tests to define two new scores of practical value. Here, we draw on the result of de Campos [3], who derived a scoring function for BN structure learning based on conditional mutual information and statistical significance tests, called MIT. The central idea is to use collective transfer entropy to measure the degree of interaction between each subsystem and its parent subsystems , but also to penalise this term with a value based on significance testing. As with the MIT score, this gives a principled way to re-scale the transfer entropy when including more edges in the graph. To develop our scores, we form a null hypothesis that there is no interaction , and then compute a test statistic to penalise the measured transfer entropy. To compute the test statistic, it is necessary to consider the measurement distribution in the case where the hypothesis is true. Unfortunately, this distribution is only analytically tractable in the case of discrete and linear-Gaussian systems, where is known to asymptotically approach the -distribution [48]. Since this distribution is a function of the parents of , we let it be described by the function . Now, given this distribution, we can fix some confidence level and determine the value such that . This represents a conditional independence test: if , then we accept the hypothesis of conditional independence between and ; otherwise, we reject it. We express this idea as the TEA score: In general, we only have access to continuous measurements of dynamical systems, and so are limited by the discrete or linear-Gaussian assumption. We can, however, use surrogate measurements to empirically compute the distribution under the assumption of [52]. This same technique has been used by [38] to derive a greedy structure learning algorithm for effective network analysis. Here, are surrogate sets of variables for , which have the same statistical properties as , but the correlation between and is removed. Let the distribution of these surrogate measurements be represented by some general function where, for the discrete and linear-Gaussian systems, we could compute analytically as an independent set of -distributions . When no analytic distribution is known, we use a resampling method (i.e., permutation or bootstrapping), creating a large number of surrogate time-series pairs by shuffling (for permutations, or redrawing for bootstrapping) the samples of and computing a population of . As with the TEA score, we fix some confidence level and determine the value , such that . This results in the TEE scoring function as We can obtain the value by (1) drawing S samples from the distribution (by permutation or bootstrapping), (2) fixing , and then (3) taking such that We can alternatively limit the number of surrogates S to and take the maximum as [22]; however, taking a larger number of surrogates will improve the validity of the distribution . Both the analytical (TEA) and empirical (TEE) scoring functions are illustrated in Figure 3. Note that the approach of significance testing is functionally equivalent to considering the log-likelihood ratio in (6), where, as stated, nested log-likelihoods (and thus transfer entropy) follows the above -distribution [48].

Figure 3

Distributions of the (a) TEA penalty function (28) and the (b) TEE penalty function (28). Both distributions were generated by observing the outcome of 1000 samples from two Gaussian variables with a correlation of . The figures illustrate: the distribution as a set of 100 sampled points (black dots); the area considered independent (grey regions); the measured transfer entropy (black line); and the difference between measurement and penalty term (dark grey region). Both tests use a value of (a p-value of ). The distribution in (a) was estimated by assuming variables were linearly-coupled Gaussians, and the distribution in (b) was computed via a kernal box method (computed by the Java Information Dynamics Toolkit (JIDT), see [52] for details).

5.2. Implementation Details and Algorithm Analysis

The two main implementation challenges that arise when performing structure learning are: (1) computing the score for every candidate network and (2) obtaining a sufficient number of samples to recover the network. The main contributions of this work are theoretical justifications for measures already in use and, fortunately, algorithmic performance has already been addressed extensively using various heuristics. Here, we present an exact, exhaustive implementation for the purpose of validating our theoretical contributions. First, for computing collective transfer entropy for the score (29), we require CPDs to be estimated from data. Given these CPDs, collective transfer entropy (A8) decomposes as a sum of p conditional transfer entropy (A7) terms, where is the size of the parent set (see Appendix B for details). Since most observations of dynamical systems are expected to be continuous, we employ a non-parametric, nearest-neighbour based approach to density estimation called the Kraskov–Stögbauer–Grassberger (KSG) estimator [43]. For any arbitrary decomposition of collective transfer entropy (i.e., any ordering of the parent set), this density estimation can be computed in time , where K is the number of nearest neighbours for each observation in a dataset of size N, and is the embedding dimension [52]. We upper bound this as since the maximum p is . Now, the above density estimation was described for an arbitrary ordering of the parent set. In the case of parametric (discrete or linear-Gaussian) density estimation, every permutation of the parent set yields equivalent results, with potentially different values for each permutation [3]; however, this is not the case for non-parametric density estimation techniques, e.g., the KSG estimator. Hence, as a conservative estimate of the score, we compute all permutations of the parent set and take the minimum collective transfer entropy. In order to obtain the surrogate distribution, we require S uncorrelated samples of the density. Since the surrogate distributions decompose in a similar manner, the score for a candidate network can be computed in time , where, again, we have upper bounded as . Using this approach, we can now compute the score (29), and thus the optimal graph can be found using any search procedure over DAGs. Exhaustive search, where all DAGs are enumerated, is typically intractable because the search space is super-exponential in the number of variables (about ), and so heuristics are often applied for efficiency. We restrict our attention to a relatively small network (a maximum of nodes) and thus we are able employ the dynamic programming (DP) approach of Silander and Myllymaki [53] to search through the space of all DAGs efficiently. This approach requires first computing the scores for all local parent sets, i.e., scores. Once each score is calculated, the DP algorithm runs in time and the entire search procedure run in time . As a consequence, the time complexity of the exhaustive algorithm is dominated by computing the scores and, in smaller networks, most of the time is spent on density estimation for surrogate distributions. Finally, the problem of inferring optimal embedding parameters is well studied in the literature. In our experimental evaluation, we set the embedding dimension to the maximum, i.e., , where d is the dimensionality of the entire latent state space (e.g., if and for each subsystem, then ). However, determining these parameters would give more insight into the system and reduce the number of samples required for inference. There are numerous criteria for optimising these parameters (e.g., [54]); most notably, the work of [55] suggests an information-theoretic approach that could be integrated into the scoring function (29) to search over the embedding parameters and DAG space simultaneously.

6. Experimental Validation

The dynamics (1) and observation (2) maps can be obtained by either differential equations, discrete-time maps, or real-world measurements. To validate our approach, we use the toy example of distributed flows, whereby the dynamics of each node are given by either the Lorenz [56] or the Rössler system of ODEs [57]. The discrete-time measurements are obtained by integrating these ODEs over constant intervals. In this section, we formally introduce this model, study the effect of changing the parameters of a coupled Lorenz–Rössler system, and finally apply our scoring function to learn the structure of up to four coupled Lorenz attractors with arbitrary graph topology. To compute the scores, we use the Java Information Dynamics Toolkit (JIDT) [52], which includes both the KSG estimator and methods for generating the surrogate distributions.

6.1. Distributed Lorenz and Rössler Attractors

For validating our scoring function, we study coupled Lorenz and Rössler attractors. The Lorenz attractor exhibits chaotic solutions for certain parameter values and has been used to describe numerous phenomena of practical interest [56,58,59]. Each Lorenz system comprises three components (), which we denote ; the state dynamics are given by: with free parameters . Similarly, the Rössler attractor has state dynamics given by: with free parameters [57]. In the distributed case, the components of each state vector are also driven by components of another subsystem. A number of different schemes have been proposed for coupling these variables, e.g., using the product [21,60] and the difference [61,62] of components. Our model uses the latter approach of linear differencing between one or more subsystem variables to couple the network. Let denote the coupling strength, C denote a three-dimensional vector of binary values, and A denote an adjacency (coupling) matrix (i.e., an matrix of zeros with iff ). Then, the state equations for M spatially distributed systems can be expressed as where represents the i-th chaotic attractor and is additive noise. In our simulations, we use , (each subsystem is coupled via variable u), and the adjacency matrices shown in Figure 4. In our experiments, we use common parameters for both attractors, i.e., and . For the observation , it is common to use one component of the state as the read-out function [4,32,33]; we therefore let . The noise terms are normally distributed with and . Figure 1 illustrates example trajectories of Lorenz–Lorenz attractors coupled via this model.

Figure 4

The network topologies used in this paper. The top row (a–d) are four arbitrary networks with three nodes () and the bottom row (e–h) are four arbitrary networks with four nodes ().

6.2. Case Study: Coupled Lorenz–Rössler System

In order to characterise the effect of coupling on our score, we begin our evaluation by measuring the transfer entropy of a coupled Lorenz–Rössler attractor. In this setup, , , and , was given by (30), and was given by (31). The transfer entropy was computed with a finite sample size of . Figure 5 shows the transfer entropy as a function of numerous parameters. In particular, the figure illustrates the effect of varying the coupling strength , embedding dimension , dynamics noise , and observation noise . As expected, increasing , or reducing either noise , increases the transfer entropy. The embedding dimension, however, increases to a set point, remains approximately constant, and then decreases. The -value above which transfer entropy remains constant illustrates the embedding dimension at which the dynamics are reconstructed; the decrease in transfer entropy after this point, however, is likely due to the finite sample size used for density estimation.

Figure 5

Transfer entropy as a function of the parameters of a coupled Lorenz–Rössler system. These components are: coupling strength and embedding dimension in the top row (a–c); coupling strength and observation noise in the middle row (d–f); and observation noise and embedding dimension in the bottom row (g–i).

There are two interesting features in Figure 5 due to the dynamical systems studied. First, in the bottom row (Figure 5g–i), there is a bifurcation around . The theoretical embedding dimension for this system is , and, in this case, for , the embedding does not suffice to reconstruct the dynamics. Second, in Figure 5i, the transfer entropy decreases after about . This appears to be the case of synchrony due to strong coupling, where the dynamics of the forced variable become subordinate to the forcing [4], thus reducing the information transferred between the two subsystems.

6.3. Case Study: Network of Lorenz Attractors

In this section, we evaluate the score (27) in learning the structure of distributed dynamical systems. We will look at systems of three and four nodes of coupled Lorenz subsystems with arbitrary topologies. Unfortunately, significantly higher number of nodes become computationally expensive due to an increased embedding dimension , number of data points N, and number of permutations required to calculate the collective transfer entropy. To evaluate the performance of the score (27), the dynamics noise is constant , whereas the observation noise and the number of observations taken N are varied. We selected the theoretical maximum embedding dimension and as is common given discrete-time measurements [22]. It should be noted that from the results from Section 6.2 that transfer entropy is sensitive to the numerous parameters used to generate the data, and thus depending on the scenario, a significant sample size can be required for recovering the underlying graph structure. We do not make an effort to reduce this sample size and instead show the effect of using a different number of samples on the accuracy of the structure learning procedure. In order to evaluate the scoring function, we compute the recall (R, or true positive rate), fallout (F, or false positive rate), and precision (P, or positive predictive value) of the recovered graph. Let TP denote the number of true positives (correct edges); TN denote the number of true negatives (correctly rejected edges); FP denote the number of false positives (incorrect edges); and FN denote the number of false negatives (incorrectly rejected edges). Then, , , and . Finally, the -score gives the harmonic mean of precision and recall to give a measure of the tests accuracy, i.e., . Note that the ideal recall, precision and -score is 1, and ideal fallout is 0. Furthermore, a ratio of >1 suggests the classifier is better than random. As a summary statistic, Table 1 and Table 2 presents the -scores for all networks illustrated in Figure 4, and the full classification results (e.g., precision, recall, and fallout) are given in Appendix C. The -scores are thus a measure of how relevant the recovered network is to the original (generating) network from our data-driven approach.

Table 1

-scores for three-node () networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4b–d (network has no edges and thus an undefined -score). The p-value of the TEE score is given in the top row of each table, with ∞ signifying using no significance testing, i.e., score (27).

		p=∞		p=0.01		p=0.001		p=0.0001
Graph	N	σψ=1	σψ=10	σψ=1	σψ=10	σψ=1	σψ=10	σψ=1	σψ=10
G2	5 K	0.8	0.5	0.8	0.5	0.8	0.5	0.8	0.5
	25 K	1	0.8	1	0.5	1	0.5	1	0.8
	100 K	1	0.5	1	1	1	1	1	0.8
G3	5 K	1	0.67	1	1	1	1	1	0.67
	25 K	1	1	1	0.5	1	1	1	1
	100 K	1	1	1	1	1	1	1	1
G4	5 K	0.8	-	0.8	0.8	0.8	0.5	0.8	-
	25 K	1	1	1	1	1	0.5	1	1
	100 K	1	1	1	1	1	1	1	1

Table 2

-scores for four-node () networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4f–h (network has no edges and thus an undefined -score). The p-value of the TEE score is given in the top row of each table, with ∞ signifying using no significance testing, i.e., score (27).

		p=∞		p=0.01		p=0.001		p=0.0001
Graph	N	σψ=1	σψ=10	σψ=1	σψ=10	σψ=1	σψ=10	σψ=1	σψ=10
G6	5 K	0.57	0.5	0.57	0.29	0.57	0.29	0.57	-
	25 K	0.75	0.33	0.75	0.33	0.75	0.29	0.75	0.33
	100 K	1	0.33	1	0.57	1	0.4	1	0.33
G7	5 K	1	0.25	1	0.29	0.75	0.25	0.75	0.57
	25 K	1	0.5	1	0.86	1	0.86	1	0.5
	100 K	1	0.86	1	0.86	1	0.86	1	0.86
G8	5 K	1	0.25	1	0.57	1	0.75	1	0.25
	25 K	1	0.86	1	0.86	1	0.86	1	0.86
	100 K	1	0.86	1	0.86	1	0.57	1	0.86

In general, the results of Table 1 and Table 2 show that the scoring function is capable of recovering the network with high precision and recall, as well as low fallout. In the table, the cell colours are shaded to indicate higher (white) to lower (black) scores. The best performing score is that with a p-value of and no penalisation (a p-value of ∞) has the second highest classification results. As expected, the graphs recovered from data with low observational noise () are more accurate than those inferred from noisier data (). The results for three-node networks (shown in Table 1) yields mostly full recovery of the structure for a higher number of observations 75 K, whereas, the four-node networks (shown in Table 2) are more difficult to classify. Interestingly, the statistical significance testing does not have a strong effect on the results. It is unclear if this is due to the use of the non-parametric density estimators, which, in effect, are parsimonious in nature since transfer entropy will likely reduce when conditioning on more variables with a fixed samples size. One challenging case is the empty networks and ; this is shown in Appendix C, where the fallout is rarely 0 for any of the p-values or sample sizes (although a large number of observations K appears to reduce spurious edges). It would be expected that significance testing on these networks would outperform the naive score (27) given that a non-zero bias is introduced for a finite number of observations. Further investigation is required to understand why the null case fails.

7. Discussion and Future Work

We have presented a principled method to compute the KL divergence for model selection in distributed dynamical systems based on concepts from differential topology. The results presented in Figure 5 and Table 1 and Table 2 illustrate that this approach is suitable for recovering synchronous GDSs from data. Further, KL divergence is related to model encoding, which is a fundamental measure used in complex systems analysis. Our result, therefore, has potential implications for other areas of research. For example, the notion of equivalence classes in BN structure learning [63] should lend insight into the area of effective network analysis [35,36]. More specifically, the approach proposed here complements explicit Bayesian identification and comparison of state space models. In DCM, and more generally in approximate Bayesian inference, models are identified in terms of their parameters via an optimisation of an approximate posterior density over model parameters with respect to a variational (free energy) bound on log evidence [64]. After these parameters have been identified, this bound can be used directly for model comparison and selection. Interestingly, free energy is derived from the KL divergence between the approximate and true posterior and thus automatically penalises more complex models; however, in Equation (8), these distributions are inverted. In future work, it would be interesting to explore the relationship between transfer entropy and the variational free energy bound. Specifically, computing an evidence bound directly from the transfer entropy may allow us to avoid the significance testing described in Section 5 and instead use an approximation to evidence for structure learning. Multivariate extensions to transfer entropy are known to eliminate redundant pairwise relationships and take into account the influence of confounding relationships in a network (i.e., synergistic effects) [65,66]. In this work, we have shown that this intuition holds for distributed dynamical systems when confined to a DAG topology. We conjecture that these methods are also applicable when cyclic dependencies exist within a graph, given any generic observation can be used in reconstructing the dynamics [50]; however, the methods presented are more likely to reveal one source in the cycle, rather than all information sources due to redundancy. There are a number of extensions that should be considered for further practical implementations of this algorithm. Currently, we assume that the dimensionality of each subsystem is known, and thus we can bound the embedding dimension for recovering the hidden structure. However, this is generally infeasible in practice and a more general algorithm would infer the embedding dimension and time delay for an unknown system. Fortunately, there are numerous techniques to recover these parameters [54,55]. Furthermore, evaluating the quality of large graphs is infeasible with our current approach. However, our exact algorithm illustrates the feasibility of state space reconstruction in recovering a graph in practice. In the future, we aim to leverage the structure learning literature on reducing the search space and approximating scoring functions to produce more efficient algorithms. Finally, the theoretical results of this work supplements understanding in fields where transfer entropy is commonly employed. Point processes are being increasingly viewed as models for a variety of information processing systems, e.g., as spiking neural trains [67] and adversaries in robotic patrolling models [68]. It was recently shown how transfer entropy can be computed for continuous time point processes such as these [67], allowing for efficient use of our analytical scoring function in a number of contexts. Another intriguing line of research is the physical and thermodynamic interpretation of transfer entropy [69], particularly its relationship to the arrow of time [70]; this relationship between endomorphisms as discussed here and time asymmetry of thermodynamics should be explored further.

Table A1

Classification results for three-node () networks for samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G1	R	-	-	-	-	-	-	-	-
	F	0.33	0.22	0.33	0.22	0.22	0.33	0.33	0.22
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G2	R	1	0.5	1	0.5	1	0.5	1	0.5
	F	0.14	0.14	0.14	0.14	0.14	0.14	0.14	0.14
	P	0.67	0.5	0.67	0.5	0.67	0.5	0.67	0.5
	F1	0.8	0.5	0.8	0.5	0.8	0.5	0.8	0.5
G3	R	1	0.5	1	1	1	1	1	0.5
	F	0	0	0	0	0	0	0	0
	P	1	1	1	1	1	1	1	1
	F1	1	0.67	1	1	1	1	1	0.67
G4	R	1	0	1	1	1	0.5	1	0
	F	0.14	0.43	0.14	0.14	0.14	0.14	0.14	0.43
	P	0.67	0	0.67	0.67	0.67	0.5	0.67	0
	F1	0.8	-	0.8	0.8	0.8	0.5	0.8	-

Table A2

Classification results for four-node () networks for samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G5	R	-	-	-	-	-	-	-	-
	F	0.31	0.25	0.31	0.19	0.31	0.25	0.31	0.19
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G6	R	0.67	0.67	0.67	0.33	0.67	0.33	0.67	0
	F	0.15	0.23	0.15	0.23	0.15	0.23	0.15	0.31
	P	0.5	0.4	0.5	0.25	0.5	0.25	0.5	0
	F1	0.57	0.5	0.57	0.29	0.57	0.29	0.57	-
G7	R	1	0.25	1	0.25	0.75	0.25	0.75	0.5
	F	0	0.25	0	0.17	0.083	0.25	0.083	0.083
	P	1	0.25	1	0.33	0.75	0.25	0.75	0.67
	F1	1	0.25	1	0.29	0.75	0.25	0.75	0.57
G8	R	1	0.25	1	0.5	1	0.75	1	0.25
	F	0	0.25	0	0.083	0	0.083	0	0.25
	P	1	0.25	1	0.67	1	0.75	1	0.25
	F1	1	0.25	1	0.57	1	0.75	1	0.25

Table A3

Classification results for three-node () networks for 10,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G1	R	-	-	-	-	-	-	-	-
	F	0.22	0.11	0.22	0.11	0.22	0.22	0.22	0.11
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G2	R	1	0.5	1	0.5	1	0.5	1	0.5
	F	0	0.14	0	0.14	0	0.14	0	0.14
	P	1	0.5	1	0.5	1	0.5	1	0.5
	F1	1	0.5	1	0.5	1	0.5	1	0.5
G3	R	1	0.5	1	1	1	0	1	0.5
	F	0	0.14	0	0	0	0.29	0	0.14
	P	1	0.5	1	1	1	0	1	0.5
	F1	1	0.5	1	1	1	-	1	0.5
G4	R	1	1	1	0.5	1	0.5	1	1
	F	0.14	0.14	0	0	0.14	0.14	0.14	0.14
	P	0.67	0.67	1	1	0.67	0.5	0.67	0.67
	F1	0.8	0.8	1	0.67	0.8	0.5	0.8	0.8

Table A4

Classification results for four-node () networks for 10,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G5	R	-	-	-	-	-	-	-	-
	F	0.31	0.25	0.31	0.19	0.31	0.19	0.31	0.25
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G6	R	0.67	0.33	0.67	0	1	1	0.67	0.33
	F	0.15	0.15	0.15	0.15	0.15	0.15	0.15	0.15
	P	0.5	0.33	0.5	0	0.6	0.6	0.5	0.33
	F1	0.57	0.33	0.57	-	0.75	0.75	0.57	0.33
G7	R	0.75	0.5	1	0.5	1	0.25	0.75	0.5
	F	0.083	0.083	0	0.083	0	0.17	0.083	0.083
	P	0.75	0.67	1	0.67	1	0.33	0.75	0.67
	F1	0.75	0.57	1	0.57	1	0.29	0.75	0.57
G8	R	1	0.25	1	0.25	1	0	1	0.25
	F	0	0.17	0	0.17	0	0.25	0	0.17
	P	1	0.33	1	0.33	1	0	1	0.33
	F1	1	0.29	1	0.29	1	-	1	0.29

Table A5

Classification results for three-node () networks for 25,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G1	R	-	-	-	-	-	-	-	-
	F	0.22	0.11	0.22	0.11	0.22	0.22	0.22	0.11
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G2	R	1	1	1	0.5	1	0.5	1	1
	F	0	0.14	0	0.14	0	0.14	0	0.14
	P	1	0.67	1	0.5	1	0.5	1	0.67
	F1	1	0.8	1	0.5	1	0.5	1	0.8
G3	R	1	1	1	0.5	1	1	1	1
	F	0	0	0	0.14	0	0	0	0
	P	1	1	1	0.5	1	1	1	1
	F1	1	1	1	0.5	1	1	1	1
G4	R	1	1	1	1	1	0.5	1	1
	F	0	0	0	0	0	0.14	0	0
	P	1	1	1	1	1	0.5	1	1
	F1	1	1	1	1	1	0.5	1	1

Table A6

Classification results for four-node () networks for 25,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G5	R	-	-	-	-	-	-	-	-
	F	0.31	0.19	0.31	0.19	0.31	0.19	0.31	0.19
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G6	R	1	0.33	1	0.33	1	0.33	1	0.33
	F	0.15	0.15	0.15	0.15	0.15	0.23	0.15	0.15
	P	0.6	0.33	0.6	0.33	0.6	0.25	0.6	0.33
	F1	0.75	0.33	0.75	0.33	0.75	0.29	0.75	0.33
G7	R	1	0.5	1	0.75	1	0.75	1	0.5
	F	0	0.17	0	0	0	0	0	0.17
	P	1	0.5	1	1	1	1	1	0.5
	F1	1	0.5	1	0.86	1	0.86	1	0.5
G8	R	1	0.75	1	0.75	1	0.75	1	0.75
	F	0	0	0	0	0	0	0	0
	P	1	1	1	1	1	1	1	1
	F1	1	0.86	1	0.86	1	0.86	1	0.86

Table A7

Classification results for three-node () networks with 50,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G1	R	-	-	-	-	-	-	-	-
	F	0	0.11	0	0	0	0.11	0	0.22
	P	-	0	-	-	-	0	-	0
	F1	-	-	-	-	-	-	-	-
G2	R	1	0.5	1	0.5	1	0.5	1	0.5
	F	0	0.14	0	0.14	0	0.14	0	0.14
	P	1	0.5	1	0.5	1	0.5	1	0.5
	F1	1	0.5	1	0.5	1	0.5	1	0.5
G3	R	1	1	1	0.5	1	1	1	1
	F	0	0.14	0	0.14	0	0.14	0	0
	P	1	0.67	1	0.5	1	0.67	1	1
	F1	1	0.8	1	0.5	1	0.8	1	1
G4	R	1	0.5	1	1	1	0.5	1	1
	F	0	0.14	0	0	0	0.14	0	0
	P	1	0.5	1	1	1	0.5	1	1
	F1	1	0.5	1	1	1	0.5	1	1

Table A8

Classification results for four-node () networks with 50,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G5	R	-	-	-	-	-	-	-	-
	F	0.19	0.062	0.19	0.19	0.19	0.12	0.19	0.12
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G6	R	1	0.33	1	0	1	0.33	1	0.33
	F	0	0.15	0	0	0	0.23	0.15	0.15
	P	1	0.33	1	-	1	0.25	0.6	0.33
	F1	1	0.33	1	-	1	0.29	0.75	0.33
G7	R	1	0.75	1	0.5	1	0.5	1	0.75
	F	0	0	0	0.17	0	0.083	0	0
	P	1	1	1	0.5	1	0.67	1	1
	F1	1	0.86	1	0.5	1	0.57	1	0.86
G8	R	1	0.75	1	0.75	1	0.75	1	0.75
	F	0	0	0	0	0	0	0	0
	P	1	1	1	1	1	1	1	1
	F1	1	0.86	1	0.86	1	0.86	1	0.86

Table A9

Classification results for three-node () networks with 100,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G1	R	-	-	-	-	-	-	-	-
	F	0	0.22	0	0.11	0	0.22	0	0.11
	P	-	0	-	0	-	0	-	0
	F1	-	-	-	-	-	-	-	-
G2	R	1	0.5	1	1	1	1	1	1
	F	0	0.14	0	0	0	0	0	0.14
	P	1	0.5	1	1	1	1	1	0.67
	F1	1	0.5	1	1	1	1	1	0.8
G3	R	1	1	1	1	1	1	1	1
	F	0	0	0	0	0	0	0	0
	P	1	1	1	1	1	1	1	1
	F1	1	1	1	1	1	1	1	1
G4	R	1	1	1	1	1	1	1	1
	F	0	0	0	0	0	0	0	0
	P	1	1	1	1	1	1	1	1
	F1	1	1	1	1	1	1	1	1

Table A10

Classification results for four-node () networks with 100,000 samples. We present the precision (P), recall (R), fallout (F), and -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.

Graph	p-Value	∞		0.01		0.001		0.0001
	σψ	1	10	1	10	1	10	1	10
G5	R	-	-	-	-	-	-	-	-
	F	0.19	0.062	0.19	0.062	0.19	0.19	0.19	0.12
	P	0	0	0	0	0	0	0	0
	F1	-	-	-	-	-	-	-	-
G6	R	1	0.33	1	0.67	1	0.33	1	0.33
	F	0	0.15	0	0.15	0	0.077	0	0.15
	P	1	0.33	1	0.5	1	0.5	1	0.33
	F1	1	0.33	1	0.57	1	0.4	1	0.33
G7	R	1	-	1	-	1	-	1	-
	F	0	-	0	-	0	-	0	-
	P	1	-	1	-	1	-	1	-
	F1	1	-	1	-	1	-	1	-
G8	R	1	0.75	1	0.75	1	0.5	1	0.75
	F	0	0	0	0	0	0.083	0	0
	P	1	1	1	1	1	0.67	1	1
	F1	1	0.86	1	0.86	1	0.57	1	0.86

30 in total

Minimising the Kullback-Leibler Divergence for Model Selection in Distributed Nonlinear Systems.

1. Introduction

2. Related Work

3. Background

3.1. Notation

3.2. Representing Distributed Dynamical Systems as Probabilistic Graphical Models

3.3. Network Scoring Functions

4. Computing Conditional KL Divergence

4.1. A Tractable Expression via Embedding Theory

4.2. Information-Theoretic Interpretation

5. Application to Structure Learning

5.1. Penalising Transfer Entropy by Independence Tests

5.2. Implementation Details and Algorithm Analysis

6. Experimental Validation

6.1. Distributed Lorenz and Rössler Attractors

6.2. Case Study: Coupled Lorenz–Rössler System

6.3. Case Study: Network of Lorenz Attractors

7. Discussion and Future Work

1. Generalized synchronization, predictability, and equivalence of unidirectionally coupled dynamical systems.

Review 2. Organization, development and function of complex brain networks.

3. Circuit implementation of synchronized chaos with applications to communications.

4. Granger causality and transfer entropy are equivalent for Gaussian variables.

5. Information modification and particle collisions in distributed computation.

Review 6. A free energy principle for the brain.

7. Generalized synchronization of chaos in directionally coupled chaotic systems.

8. Transfer entropy in physical systems and the arrow of time.

9. Information flow and causality as rigorous notions ab initio.

10. Transfer entropy--a model-free measure of effective connectivity for the neurosciences.