Literature DB >> 35486689

On the dimensionality of behavior.

Abstract

There is a growing effort in the “physics of behavior” that aims at complete quantitative characterization of animal movements under more complex, naturalistic conditions. One reaction to the resulting explosion of high-dimensional data is the search for low-dimensional structure. Here I try to define more clearly what we mean by the dimensionality of behavior, where observable behavior may consist of either continuous trajectories or sequences of discrete states. This discussion also serves to isolate situations in which the dimensionality of behavior is effectively infinite.

Entities: Chemical

Keywords: complexity; information; prediction

Mesh：

Year: 2022 PMID： 35486689 PMCID： PMC9170048 DOI： 10.1073/pnas.2021860119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 12.779

Observations on behavior provide a window into the dynamics of the brain and mind. This is an ancient idea, now receiving renewed attention because of the explosive growth of methods for quantitative measurements of behavior (1–8). These methods produce enormous quantities of raw data, such as high-resolution videos, so there is an obvious practical interest in data compression. This often involves searching for a low-dimensional description of the animal’s configuration or posture at each moment in time. This search is grounded both by the observation that even large and complex animals have relatively small numbers of muscles or joints and by direct evidence that motor behaviors are described by low-dimensional models in organisms from the worm Caenorhabditis elegans to humans and nonhuman primates (1, 9–14). Reducing great volumes of video data to time series for just a few degrees of freedom is a triumph. The fact that this now can be done more or less automatically with machine learning methods means that exhaustive and quantitative characterization of behavior is possible in a much wider range of organisms, under a wider range of conditions. But the classical literature on dynamical systems reminds us that the time series of even a single variable could encode a higher-dimensional structure (15, 16). Indeed, this seems natural: The brain that generates behavior has many degrees of freedom, and observations of behavior should be sensitive to these potentially high-dimensional dynamics. On the other hand, the dynamics of large neural networks might be confined to low-dimensional manifolds, perhaps to match the dimensionality of motor behaviors (17–20). All of these developments point to the problem of defining the dimensionality of behavior. In the extreme, we can imagine that the observable behavior reduces to a single function of time, as with the opening angle of a clamshell. Can we analyze this time series to identify a well-defined dimensionality for the underlying dynamics? Is it possible that this dimensionality is effectively infinite?

A Context for Phenomenology

The quantitative analysis of behavior, including what follows here, is unapologetically phenomenological. The question is not “How does the brain generate behavior?” but rather “What is it about behavior that we would like to explain?” In an era of highly mechanistic biology, this emphasis on phenomenological description may seem odd. So, at the risk of repeating things that are well known, it is useful to remind ourselves of the long historical context for this approach. If we want to explain why we look like our parents, a qualitative answer is that we carry copies of their DNA. But, if we want to understand the reliability with which traits pass from generation to generation, then DNA structure is not enough—the free energy differences between correct and incorrect base pairing are not sufficient to explain the reliability of molecular copying if the reactions are allowed to come to thermal equilibrium, and this problem arises not just in DNA replication but in every step of molecular information transmission. Cells achieve their observed reliability by holding these reactions away from equilibrium, allowing for proofreading or error correction (21, 22). In the absence of proofreading, the majority of proteins would contain at least one incorrect amino acid, and ∼10% of our genes would be different from those carried by either parent; these error rates are orders of magnitude larger than observed. These quantitative differences are so large that life without proofreading would be qualitatively different.* The example of proofreading highlights the importance of starting with a quantitative characterization of the phenomena we are trying to explain. For brains and behavior, this is an old idea. In the late nineteenth century, many people were trying to turn observations on seeing and hearing into quantitative experiments, creating a subject that would come to be called psychophysics (23). By ∼1910, these experiments were sufficiently well developed that Lorentz could look at data on the “minimum visible” and suggest that the retina is capable of counting single photons (24), and Rayleigh could identify the conflict between our ability to localize low-frequency sounds and the conventional wisdom that we are “phase deaf” (25). Both of these essentially theoretical observations, grounded in quantitative descriptions of human behavior, drove experimental efforts that unfolded over more than half a century. Also ∼1910, von Frisch (26) was doing psychophysics experiments to demonstrate bees could, in fact, discriminate among the beautiful colors of the flowers that they pollinate. But he took these experiments in a very different direction, focusing not on the discrete choices made by individual bees but on how these individuals communicated their sensory experiences to other residents of the hive, leading to the discovery of the “dance language” of bees. What grew out of the work by von Frisch and others was ethology (28), which emphasizes the richness of behavior in its natural context, the context in which it was selected for by evolution. Because ethologists wrestle with complex behaviors, they often resort to verbal description. In contrast, psychophysicists focus on situations in which subjects are constrained to a small number of discrete alternative behaviors, so it is natural to give a quantitative description by estimating the probabilities of different choices under various conditions. The emergence of a quantitative language for the analysis of psychophysical experiments was aided by the focus on constrained behaviors, but was not an automatic consequence of this focus. For photon counting in vision, the underlying physics suggests how the probability of seeing vs. not seeing will depend on light intensity (29), but the observation that human observers behave as predicted points to profound facts about the underlying mechanisms (30). Attempts to formalize the problems of detection led to a more general view of the choices among discrete alternative behaviors being discriminations among signals in a background of noise (31), and, in the 1950s and 1960s, this view was exported to experimental psychology (23). Much of this now seems like an exercise in probability and statistics, something obviously correct, but the early literature records considerable skepticism about whether this (or perhaps any) mathematization of human behavior would succeed. More generally, quantitative phenomenology has been foundational, certainly in physics and also in the mainstream of biology. Mendel’s genetics was a phenomenological description of the patterns of inheritance, and the realization that genes are arranged linearly along chromosomes came from a more refined quantitative analysis of these same patterns (32). The work of Hodgkin and Huxley (33) led to our modern understanding of electrical activity in terms of ion channel dynamics, but explicitly eschewed mechanistic claims in favor of phenomenology. The idea that transmission across a synapse depends on transmitter molecules packaged into vesicles emerged from the quantitative analysis of voltage fluctuations at the neuromuscular junction (34). Even when we are searching for microscopic mechanisms, it is not anachronistic to explore macroscopic descriptions. Time and again, the scientific community has leaned on phenomenology to imagine the underlying mechanisms, often taking literally the individual terms in a mathematical description as representing the actual microscopic elements for which we should be searching, whether these are genes, ion channels, synaptic vesicles, or quarks (35–37). What is anachronistic, in the literal sense of the word, is to believe that microscopic mechanisms were discovered by direct microscopic observations without guidance from phenomenology on a larger scale. In this broad context, how can we construct a quantitative phenomenology of complex, naturalistic behaviors? When we do psychophysics, we characterize behaviors with numbers that are meaningfully comparable across situations and across species. To give but one example, we can discuss the accumulation of evidence for decisions that humans and nonhuman primates make based on visual inputs, but we can use the same mathematical language to discuss decisions made by rodents based on auditory inputs (38). A quantitative characterization of naturalistic behaviors requires, similarly, that we attach comparable numbers to very different kinds of time series. The dimensionality of behavior is a candidate for this sort of unifying mathematical language.

Two Examples

To work toward a sharper definition, consider the case in which the behavior we observe is just a single function of time, x(t). Two examples of such trajectories are shown in Fig. 1, Left, and we will see that these correspond to one-dimensional (blue) and two-dimensional (red) systems. Qualitatively, the blue trajectory varies on one characteristic time scale, while the red trajectory involves rough movements on a short time scale superposed on smoother movements over a longer time scale. Our task is to make these observations precise.

Fig. 1.

Two examples of behavioral trajectories (Left) and their correlation functions (Right). One-dimensional example, from Eq. , is shown in blue. Two-dimensional example, from Eq. , is shown in red. Behavioral trajectories are offset for clarity, and time is measured in units of the correlation time τ. Let’s work backward and start with a model for the behavior, a model in which it seems clear that the behavior really is one dimensional: The observed behavioral trajectory x(t) is described completely bywhere is white noise, It is important that the noise source is white; nonwhite noise sources, which themselves are correlated over time, are equivalent to having hidden degrees of freedom that carry these correlations. The blue trajectory in Fig. 1, Left is drawn from a simulation of Eq. with . The observable consequences of the dynamics in Eqs. and . are well known: x(t) will be a Gaussian stochastic process, with the two-point correlation functionshown in Fig. 1, Right. We recall that, for a Gaussian process, once we specify the two-point function, there is nothing else to say about the system. Importantly, we can turn this around: If the observed behavior is a Gaussian stochastic process, and the correlations decay exponentially as in Eq. , then Eqs. and are a complete description of the dynamics. An example of a clearly two-dimensional system involves not only the observable x(t) but also an internal variable y(t), To keep things simple, we can assume that the driving noises are independent of one another, and, again, they should be white so that we are not hiding additional variables that carry correlations. Since y is hidden, its units are arbitrary, which allows us to have the strength of the noise driving each variable be the same without loss of generality, so that The choice to give each variable the same correlation time is just for illustration, as is the symmetry of the dynamical matrix; the red trajectory in Fig. 1, Left is drawn from a simulation of Eq. with and . Again, x(t) is Gaussian, but now the correlation function has two exponential decays,shown in Fig. 1, Right. The short time scale corresponds to the rough movements seen in the trajectory, while corresponds to the smoother movements. We see that a one-dimensional system generates behavior with a correlation function that has one exponential decay, while a two-dimensional system generates a correlation function with two exponential decays. We would like to turn this around, and say that, if we observe a certain structure in the behavioral correlations, then we can infer the underlying dimensionality.

Gaussian Processes More Generally

Analyzing behavioral trajectories by constructing explicit dynamical equations, as in Eq. or , may not be the best approach. In particular, if there are hidden dimensions, then there is no preferred coordinate system in the space of unmeasured variables, and hence no unique form for the dynamical equations. Let us think, instead, about the probability distribution of the observed trajectories x(t). For Gaussian processes, this has the formwhere the integrals run over the interval of our observations, which should be long. The kernel is inverse to the correlation function, We can divide the full trajectory x(t) into the past, , with , and the future, , with t > 0. Schematically,where couples the past and future. More explicitly, If is of finite rank, so thatthen everything that we can predict about future behavior given knowledge of past behavior is captured by D features, Eq. is telling us that the features provide “sufficient statistics” for making predictions. We recall that, in a dynamical system with D variables,predicting the future (t > 0) requires specifying D initial conditions (at t = 0). In this precise sense, the number of variables that we need to achieve maximum predictive power is the dimensionality of the dynamical system. To complete the argument, we need to show that has finite rank when correlations decay as a finite combination of exponentials; see . In the case of Gaussian stochastic processes, we thus arrive at a recipe for defining the dimensionality of the underlying dynamics. We estimate the correlation function, take its inverse to find the kernel, and isolate the part of this kernel which couples past and future. If this past–future kernel is of finite rank, then we can identify this rank with the dimensionality of the system. In Fig. 2, Top we see a sample trajectory (in red) from a system that is, by construction, three dimensional, with correlation times , and the discrete time step () of our observations. As explained in , the coefficients in Eq. can be found as the eigenvalues of a symmetric matrix, and these eigenvalues are plotted in Fig. 2, Bottom in rank order (in red). This numerical analysis yields three clearly nonzero eigenvalues, with other eigenvalues below . Importantly, we find essentially the same three eigenvalues when the analysis is done in time windows of very different sizes—here and , smaller and larger, respectively, than the longest correlation time.

Fig. 2.

Sample trajectories (Top) and spectra of the matrix (Bottom). In red is an example in which the underlying dynamics is three dimensional. In blue is an example with power law correlations, as in Eq. with , which is effectively infinite dimensional. Time is measured in discrete steps , and spectra are computed in windows of duration (×) or (). Details are provided in . The past–future coupling is not guaranteed to be of finite rank. More generally, if we analyze signals in a window of size T, then the rank can grow with T. This happens, for example, if behavioral correlations decay as a power of time, Fig. 2, Top shows a sample trajectory from a Gaussian process with this correlation function (in blue), and Fig. 2, Bottom shows the associated spectrum of coefficients for (in blue). This illustrates both that there is no obvious cutoff to the spectrum and that the spectrum extends farther when the analysis is done in longer time windows [ (×) vs. ()]. Indeed, the larger the window, the farther the spectrum extends, with no bound. Under these conditions, the dimensionality is effectively infinite. The possibility that behavioral correlations decay as a power of time has a long and sometimes contentious history. It thus is worth noting that scaling of the correlation function implies an effectively infinite dimensionality, but it is not required. We can imagine situations in which the kernel has an arbitrarily large number of nonzero eigenvalues in the limit of long observation times even if the correlation is not precisely a power law. While the relation of dimensionality to the spectrum of is attractive, estimating this spectrum from finite data can be challenging. Even if the true spectrum has only a finite number of nonzero eigenvalues, in matrices built from finite samples of data, the zero eigenvalues will be replaced by a continuous spectrum, and this could make it difficult, in practice, to distinguish finite from infinite dimensional processes. At the same time, it is important to emphasize that difficulty in resolving eigenvalues against a continuum generated by finite sample size is not evidence for low dimensionality, nor should a continuum be assigned as noise without further analysis. Random matrix theory provides quantitative predictions for spectral broadening in closely related contexts, including the dependence of spectra on sample size and matrix dimensionality, and these should provide a basis for identifying the contributions of noise to the observed eigenvalue spectra (39).

Discrete States

In many cases, it is natural to describe animal behavior as moving through a sequence of discrete states. We do this, for example, when we transcribe human speech to text, and when we describe a bacterium as running or tumbling (40). This identification of discrete states is not just an arbitrary quantization of continuous motor outputs, nor should it be a qualitative judgment by human observers. Discrete states should correspond to distinguishable clusters, or resolvable peaks in the distribution over the natural continuous variables, and the dynamics should consist of movements in the neighborhood of one peak that are punctuated by relatively rapid jumps to another peak (e.g., ref. 3). A “mechanism” for such discreteness is the existence of multiple dynamical attractors, with jumps driven by noise (e.g., refs. 1 and 13). When behavioral states are discrete, how do we define dimensionality? Once again, it is useful to think about the simplest case, where there are just two behavioral states—perhaps “doing something” and “doing nothing”—and time is marked by discrete ticks of a clock. We can represent the two states at each time t by an Ising variable . If the sequence of behavioral states were Markovian, then σ depends only on , and, because , the only possible stationary probability distribution for the sequences iswhich is the one-dimensional Ising model with nearest-neighbor interactions. Importantly, if we measure the correlations of the fluctuations in behavioral state around its mean,we find that these correlations decay exponentially,where we can express τ in terms of h and J (41). This reminds us of the exponential decays in the continuous case with Gaussian fluctuations. Suppose that we have only two states, but observe correlations that do not decay as a single exponential. Then the probability distribution must have terms that describe explicit dependences of σ on with . This can be true only if there are some hidden states or variables that carry memory across the temporal gap . A sensible definition for the dimensionality of behavior then refers to these internal variables. Imagine that we observe the mean of the behavioral variable, , and the correlation function . What can we say about the probability distribution ? There are infinitely many models that are consistent with measurements of just these (two-point) correlations, but there is one that stands out as having the minimal structure required to match these observations (42). Said another way, there is a unique model that predicts the observed correlations but otherwise generates behavioral sequences that are as random as possible. This minimally structured model is the one that has the largest possible entropy, and it has the formwhere the parameter h must be adjusted so that the model predicts the observed mean behavior , and the function must be adjusted so that the model predicts the observed correlation function . Maximum entropy models have a long history, and a deep connection to statistical mechanics (42). As applied to temporal sequences, the maximum entropy models sometimes are referred to as maximum caliber (43). For biological systems, there has been interest in the use of maximum entropy methods to describe amino acid sequence variation in protein families (44–46), patterns of electrical activity in populations of neurons (47–51), velocity fluctuations in flocks of birds (52, 53), and more. There have been more limited attempts to use these ideas in describing temporal sequences, in neural populations (54) and in flocks (55–57). To connect with the previous discussion, for continuous variables, a Gaussian process is the maximum entropy model consistent with the measured (two-point) correlations. In particular, if correlations decay as a combination of exponentials, then, in discrete time, the relevant Gaussian model has maximum entropy consistent with correlations among a finite number of neighboring time points. These models can also be written as autoregressive processes (58). The maximum entropy model in Eq. can be rewritten exactly as a model in which the behavioral state at time t depends only on some internal variable x(t). As explained in , x(t) is not Gaussian, but the only coupling of past and future, again, is through a kernel K(t). This kernel is not the inverse of the observed behavioral correlations but of the effective interactions between states at different times, . But, importantly, we are considering quantities that are determined by the correlation function, and hence the problem is conceptually similar to the Gaussian case: We analyze the correlations to derive a kernel, and the dimensionality of behavior is the rank of this kernel. The maximum entropy model plays a useful role because it is the least structured model consistent with the observed correlations. If x(t) is one dimensional in the sense defined above, then the interactions decay over some fixed time scale, , and, at long times, the correlations also will decay exponentially. At the opposite extreme, if x(t) has effectively infinite dimensionality, then we can have . Ising models with such power-law interactions are the subject of a large literature in statistical physics; the richest behaviors are at α = 2, where results presaged major developments in the renormalization group and topological phase transitions (59–62). It would be fascinating if these models emerged as effective descriptions of strongly non-Markovian sequences in animal behavior, as suggested recently (63).

Generalization

In both the continuous Gaussian case and the discrete case, dimensionality can be measured through the problem of prediction. To make this more general, consider observations of behavior in a time window ; for simplicity, I will keep the notation x(t) for the behavioral trajectory. Within each window, the trajectory defines the past defines the future , and these are drawn from the joint probability distribution . To characterize the possibility of making predictions, we can measure the mutual information between past and future, This “predictive information” can have very different qualitative behaviors as T becomes large (64). For a time series that can be captured by a finite-state Markov process, or more generally described by a finite correlation time, then is finite as . On the other hand, for Gaussian processes with correlation functions that decay as a power, as in Eq. , the predictive information diverges logarithmically, , and similarly for discrete time series with power-law correlations. In the example of a dynamical system with D variables, as in Eq. , all the predictive power available will be realized if we can specify D numbers, which are the initial conditions for integrating the differential equations. Thus we consider smooth mappings of the past into d features, For any choice of features, we can compute how much predictive information has been captured, and then we can maximize over the mapping, resulting inwhich is the maximum predictive information we can capture with d features in windows of duration T. If the system truly is D dimensional, then D features of the past are sufficient to capture all of the available predictive information. This means that a plot of vs. d will saturate. To be precise, we are interested in what happens at large T, so we can define If , then we can write the analog of Eq. ,where the features now are more complex functions of the past. But there are only D of these features needed to make Eq. true (), and so we conclude that the behavior has dimensionality D. The equivalence of Eq. to Eq. immediately tells us that the general information theoretic definition of dimensionality agrees with the definition for Gaussian processes based on the spectrum of . In the Gaussian case, we see that the features are just linearly filtered versions of the past, as in Eq. . The connection to the discussion of two-state variables is a bit more complicated, and exploits the equivalence to an internal or latent variable as described in .

Conclusion

The arguments here define the dimensionality of behavior as the minimum number of features of the past needed to make the maximally informative predictions about the future. As we consider pasts of longer duration, the dimensionality can grow, potentially without bound. The connection between dimensionality and prediction is familiar from the now classical literature on dynamical systems, which also reminds us that, in its most general form, any such definition runs into all the well-known difficulties of estimating dimensions from finite data (65). More useful is the result that, in some cases, estimating this predictive dimensionality reduces to analyzing the spectrum of a matrix.

40 in total

On the dimensionality of behavior.

A Context for Phenomenology

Two Examples

Gaussian Processes More Generally

Discrete States

Generalization

Conclusion

1. Human arm movements described by a low-dimensional superposition of principal components.

2. Kinetic amplification of enzyme discrimination.

3. Mapping Sub-Second Structure in Mouse Behavior.

4. Identification of direct residue contacts in protein-protein interaction by message passing.

5. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning.

6. Fast animal pose estimation using deep neural networks.

7. Searching for collective behavior in a large network of sensory neurons.

8. Collective Behavior of Place and Non-place Neurons in the Hippocampal Network.

9. Thermodynamics and signatures of criticality in a network of neurons.

10. Local equilibrium in bird flocks.

Review 1. The Promise of Behavioral Tracking Systems for Advancing Primate Animal Welfare.