Literature DB >> 35205553

An Information Theoretic Approach to Symbolic Learning in Synthetic Languages.

Abstract

An important aspect of using entropy-based models and proposed "synthetic languages", is the seemingly simple task of knowing how to identify the probabilistic symbols. If the system has discrete features, then this task may be trivial; however, for observed analog behaviors described by continuous values, this raises the question of how we should determine such symbols. This task of symbolization extends the concept of scalar and vector quantization to consider explicit linguistic properties. Unlike previous quantization algorithms where the aim is primarily data compression and fidelity, the goal in this case is to produce a symbolic output sequence which incorporates some linguistic properties and hence is useful in forming language-based models. Hence, in this paper, we present methods for symbolization which take into account such properties in the form of probabilistic constraints. In particular, we propose new symbolization algorithms which constrain the symbols to have a Zipf-Mandelbrot-Li distribution which approximates the behavior of language elements. We introduce a novel constrained EM algorithm which is shown to effectively learn to produce symbols which approximate a Zipfian distribution. We demonstrate the efficacy of the proposed approaches on some examples using real world data in different tasks, including the translation of animal behavior into a possible human language understandable equivalent.

Entities: Chemical

Keywords: Zipf–Mandelbrot–Li law; behavior prediction; entropy; information theoretic models; language models; synthetic language

Year: 2022 PMID： 35205553 PMCID： PMC8871184 DOI： 10.3390/e24020259

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Language is the primary way in which humans function intelligently in the world. Without language, it is almost inconceivable that we as a species could survive. Language is clearly far richer than mere words on a page or even spoken words. Almost every observable dynamic system in the world can be considered as having its own language of “words” that carry meaning. Put simply, we propose that “every system is language”. In contrast to classical value-based models such as those employed in signal processing, or even the concept of quantized models employing discrete values such as those found in classifiers, we propose that the next phase of AI systems may be based on the concept of synthetic languages. Such languages, we suggest, will provide the basis for AI to learn to interact with the world in varying levels of complexity, but without the technical challenges of infinite expressions [1] and building natural language processing systems based on human language. The concept of synthetic languages is that it is possible to derive a model to capture meaning which is either emergent or assigned in natural systems. While this approach can be understood for animal communications, we suggest that it may be useful in a wide range of other domains. For example, instead of modeling systems based on hard classifications based on some measured features, a synthetic language approach could be useful for developing an understanding of meaning using behavioral models based on sequences of probabilistic events. These events might be captured as simple language elements. This approach of synthetic language stems from earlier work we have done based on entropy-based models. Normally, entropy requires a large number of samples to estimate accurately [2]. However, we have previously developed an efficient algorithm which permits entropy to be estimated with a small number of samples which has enabled the development of simple entropy-based behavioral models. For example, an early application has shown promising results, successfully detecting dementia in a control group of patients by listening to their conversation using a simple synthetic language consisting of 10 “synthetic words” derived from the inter-speech pause lengths [3]. The basis for this approach is the view that probabilistically framed behavioral events derived from dynamical systems may be viewed as words within a synthetic language. In contrast to human languages, we hypothesize the existence of synthetic languages defined by small alphabet sizes, limited vocabulary, reduced linguistic complexity and simplified meaning. Yet, how can a systematic approach be developed which determines such synthetic words? Usually the aim of human speech recognition is to form a probabilistic model which enables short term bursts of speech audio to be classified as particular words. However, with the approach being proposed here, we might consider meta-languages where other speech elements, perhaps sighs, coughs, emotive elements, coded speech, specific articulations or almost any behavioral phenomena can be considered as a synthetic language. The challenge is that we are seeking to discover language elements from some input signals, yet while we can know the ground truth of human language elements and hence determine the veracity of any particular symbolization algorithm, this task is not straightforward for synthetic languages where we do not have access to a ground truth dataset. Hence, while much of computational natural language processing relies significantly on modeling complex probabilistic interactions between language elements resulting in models such as hidden Markov models, and smoothing algorithms to capture the richness and complexities of human language, in our synthetic language approach the aims are considerably lower. However, even with the significantly reduced complexity, we have found significant potential. Consider one of the most basic questions of language—what are the language elements? Language can be viewed as observing one or more discrete random variables X of a sequence X = that is, may take on one of M distinct values, is a set from which the members of the sequence are drawn, and hence is in this sense symbolic, where each value occurs with probability One of the earliest methods of characterizing the probabilistic properties of symbolic sequences was proposed by Shannon [4,5] who proposed the concept of entropy, which can be defined as Entropy methods have been applied to a wide range of applications, including language description [6,7,8], recognition tasks [9,10,11,12,13], identification of disease markers through human gene mapping [14,15], phylogenetic diversity measurement [16], population biology [17] and drug discovery [18]. An important task in applying entropy methods and subsequently in deriving synthetic languages is the seemingly simple task of knowing what the probabilistic events are. If the system has discrete features, then this task may be trivial; however, for continuous-valued analog observed behaviors, this raises the question of how we should determine what the symbols are. For example, suppose we wish to convert the movement of a human body into a synthetic language. How can the movements, gestures or even speech be converted into appropriate symbols? This symbolization task is related to the well known problem of scalar quantization [19,20] and vector quantization [21,22] where the idea is to map a sequence of continuous or discrete values to a symbolic digital sequence for the purpose of digital communications. Usually the aim of this approach is data compression so that the minimum storage or bandwidth is required to transmit a given message within some fidelity constraints. Within this context, a range of algorithms have been derived to provide quantization properties. Earlier work using vector quantization has been proposed in conjunction with speech recognition and speaker identification. A method of speaker identification was proposed using a k-means algorithm to extract speech features and then vector quantization was applied to the extracted features [23]. An example of learning a vector quantized latent representation with phonemic output from human speech was demonstrated in [24]. The concept of combining vector quantization with extraction of speaker-invariant linguistic features appears in applications of voice conversion, where the aim is to convert the voice of a source speaker to that of a target speaker without altering the linguistic content [25]. A typical approach here is to performing spectral conversion between speakers, such as using a Gaussian mixture model (GMM) of the joint probability density of source and target features or by identifying a phonemic model to isolate some linguistic features and then perform a conversion, while minimizing some error metric such as the frame by frame minimum mean square error. A common aspect of these models is generally to isolate some linguistic features and then perform to vector quantization. Extensions of this work include learning vector quantization [26], with more recent extensions to deep learning [27], federated learning [28], entropy-based constraints [29]. A neural autoencoder incorporating vector quantization was demonstrated to learn phonemic representations [30]. A method for improving compression on deep features using an entropy-optimized loss function for vector quantization and entropy coding modules to jointly minimize the total coding cost was proposed in [31]. In the work we present here, our interest is rather different in its goals from vector quantization and hence the algorithms we derive take a different direction. In particular, while vector quantization approaches tend to be aimed at efficient communications with minimum bit rates, we seek to discover algorithms which might uncover emergent language primitives. For example, given a stream of continuous values, it is possible to find a set of language elements which might be understood as letters or words. While it may be expected that such language based coding representations will also provide a degree of compression efficiency, this is not necessarily our primary goal. The goal of symbolization can therefore be differentiated from quantization in that the properties of determining language primitives may be very different from simply efficient data compression or even fidelity of reconstruction. For example, these properties may include metrics of robustness, intelligibility, identifiability, and learnability. In other words, the properties, goals and functional aspects of language elements are considerably more complex in nature than those used in vector quantization methods. In this paper, we propose an information theoretic approach for learning the symbols or letters within a synthetic language without any prior information. We demonstrate the efficacy of the proposed approaches on some examples using real world data in quite different tasks including the translation of the movement of a biological agent into a potential human language equivalent.

2. Synthetic Language Symbols

2.1. Aspects of Symbolization

Consider a sequence of continuous or discrete valued inputs. How can we obtain a corresponding sequence of symbols derived from this input? An M-level quantizer is a function which maps the input into one of a range of values. A quantizer is said to be optimum in the Lloyd-Max sense if it minimizes the average distortion for a fixed number of levels M [19,32]. A symbolization algorithm converts any continuous-valued random variable input to a discrete random variable X of a sequence X = where In this sense, can be viewed as a component in an alphabet or finite nonempty set with symbolic members and dimensionality Hence, a trivial example of symbolization is the M-level quantization by direct partitioning of the input space. In this case, for some continuous random variable set then we have an output where where the quantization levels are bounded as and where are chosen according to any desired strategy, for example and This approach partitions the input space such that the greater the value of the input the higher the symbolic code in the alphabet and where the partitions are distributed according to a cumulative normal distribution. This simple probabilistic algorithm encompasses the full input space, and has been demonstrated to provide useful results. However, while any arbitrary symbolization scheme can be applied, in the context of extracting language elements it is reasonable to derive methods with preservation properties. Previously we have derived entropy estimation algorithms which include linguistic properties such a orthographic constraints with Zipf–Mandelbrot–Li distribution [33]. Here we consider symbolization using a similar concept of preserving linguistic properties. In the first instance, we consider a symbolization method which introduces a Zipf–Mandelbrot–Li constraint, ensuring the derived language symbols reflect Zipfian properties. We then introduce a constraint which seeks to preserve a defined language intelligibility properties. This approach can be further extended to consider properties such as preservation of prediction. Another approach we propose is to systematically partition the input space according to linguistic probabilistic principles so that the derived language approximates natural language. A method of doing this to partition the input space such that each region has an associated probability which approximates the expected natural language events. The methods we present here can be compared to prior work such as universal algorithms for classification and prediction [34]. In contrast to these prior approaches however, there are some major differences. In particular, since our interest is in language, the sequences we consider are non-ergodic [35,36]. However, universal classification methods are usually associated with ergodic processes [37]. Hence, these differences lead us to consider a very different approach than these prior works. While not only technically different, this means language symbolization requires very different approaches than those developed for ergodic universal classifiers. Our aim in this paper is to present and explore some of these different approaches which we consider further below.

2.2. Zipf–Mandelbrot–Li Symbolization

Given the task of symbolization, we seek to constrain the output symbols to properties which might reflect some of the linguistic structure of the observed sequence. There are clearly many ways in which this can be achieved and in this algorithm we propose a simple approach of probabilistic constraint as a first step in this direction. For natural language, described in terms of a sequence of probabilistic events, it has been shown that Zipf’s law [38,39,40,41,42,43] describes the probability of information events that can generally be ranked into monotonically decreasing order. This raises the question of whether it possible to derive a symbolization algorithm which produces symbols which follow a Zipfian law. The idea here is that since we are seeking an algorithm which will symbolize the input to reflect language characteristics, then it is plausible that such an algorithm will produce symbols which will approximate a Zipfian distribution. In this section, we propose such a symbolization method. We have previously proposed a new variation of the Zipf–Mandelbrot–Li (ZML) law [2,39,44], which models the frequency rank r of a language element from an alphabet of size An advantage of this model is that it provides a discrete analytic form with reasonable accuracy given only the rank and the alphabet size. Moreover, it has the further advantage that it can be extended to include linguistic properties to improve the accuracy when compared against actual language data [33]. Hence, a symbolization process based on the ZML law can be obtained and a derivation for this symbolization algorithm based on the original development in [39,45,46] is shown in Appendix A. Using this approach, where a precise value of the ranked probability is available or each rank, given alphabet size M, then given a partitioned input space, the algorithm firstly enables the ordering of each partition in terms of its probabilistic rank such that each partition corresponds the closest ZML probability. We then derive an algorithm termed CDF-ZML to constrain the partition probabilities with probabilistic characteristics approximating a ZML distribution. An example of the CDF-ZML approach with the resulting partitioning is shown in Figure 1, where it can be observed that the most probable symbol occurs initially, and is then followed by successively smaller probabilities (indicated by the area under the curve).

Figure 1

The proposed CDF-ZML linguistic symbolization algorithm partitions the input space to provide symbols which follow a Zipfian distribution. In this one dimensional example, the symbols with highest rank are left-most and are followed by successively smaller probabilities as indicated by the area under the curve. The symbols are defined by the occurrence of a continuous valued input within a given partition. Hence each symbol has a corresponding probability defined by a Zipfian–Mandelbrot–Li distribution.

In contrast to simple M-level quantization schemes, this approach to partitioning is not designed to optimize some metric of data compression, but rather it is a nonlinear partitioning of the input space to ensure the probability distribution of the data will follow an approximately Zipfian law. Thus, it may be viewed as a first step in seeking to symbolize an input sequence as language elements. Note that we cannot claim that the elements do indeed form the basis for actual language elements, so this is to be regarded as a first, but useful step. In human language terms, this could be potentially useful for example, in the segmentation of audio speech signals into phonemic symbols [47]. Hence, it may be similarly useful in applications of synthetic language to extract realistic language elements. Clearly this approach can be further extended, however, to consider other language-based symbolization methods. In the next section we consider another methodology which examines a more sophisticated property of language and how they may be incorporated into symbolization processes.

2.3. Maximum Intelligibility Symbolization

Intelligibility is an important communications property which has received considerable interest [48,49,50]. Hence, a symbolization method which can potentially enhance intelligibility may be useful for deriving language based symbols. Are there any precedents of naturally occurring intelligibility maximization? In other words, since we are interested in synthetic languages which are not restricted to human language, are there other examples of naturally occurring forms of language or information transmission systems which seek to enhance a measure of intelligibility as a function of event probabilities? In fact there are such examples which occur in natural biological processes. Intelligibility optimization is observed in genetic recombination events during meiosis where interference which reduces the probability of recombination events is a function of chromosomal distance [51]. In this case, genes closer together encounter greater interference, and hence undergo fewer crossing over events, while for genes that are far apart crossover and non-crossover events will occur in equal frequency. Another way to view this is that the greatest intelligibility of genetic crossing over occurs with distant genes. Accordingly, in this section, we develop a symbolization algorithm which seeks to optimize intelligibility. Note that there are various metrics for defining intelligibility, and so the approach we propose here is not intended to be optimal in the widest sense possible, other approaches are undoubtedly possible and may yield better results or be better suited to particular tasks or languages. Based on the concept that probabilistically similar events could be confused with each other, especially in the presence of noise, intelligibility can be evidently defined as a function of the distance between probabilistically similar events. In this case, the idea is that the symbols which are closest in probabilistic ranking should be made as distant as possible in input space. Hence, we can introduce a probability-weighted constraint to be maximized such as where enables a continuously variable weighting to be applied to the probabilistic lexicon of symbolic events with symbols with distance indices and This approach can be generalized in various ways to include optimization constraints for features such as robustness, alphabet size or variance of the symbolic input range. Hence, if we require a symbolization scheme which will provide improved performance in the presence of noise, then it may be useful to consider constraints which take this into account. Here we propose a symbolization approach using a probabilistic divergence measure as an intelligibility metric. The goal is that the symbolization algorithm will produce a sequence of symbols which have a property of maximum intelligibility. The way we do this is to ensure that nearby symbols are the least likely to occur together. This is somewhat similar to the principle used in the typewriter QWERTY keyboard, where the idea was to place keys together which are unlikely to be used in succession [52]. The proposed MaxIntel algorithm is aimed to producing a symbolization scheme which arranges nearby symbol partitions, according to the value of the input, to maximize probabilistic intelligibility. The derivation of the symbolization algorithm with maximum intelligibility constraints is given in Appendix B. A diagram of this intelligibility CDF indexing is shown in Figure 2.

Figure 2

A natural extension of the symbolization process is to optimize the intelligibility of the observed sequence. The rank ordering process of the proposed intelligibility symbolization algorithm is illustrated here.

The performance of the proposed intelligibility constrained symbolization algorithm is shown in Figure 3. In this case, partitioning is constrained to maximize a probabilistic divergence measure between sequential unigrams which optimizes the probabilistically framed intelligibility.

Figure 3

The performance of the proposed maximum intelligibility symbolization algorithm is shown in this example. The lower red curve shows the intelligibility of regular symbolization, while the upper green curve shows the intelligibility of the proposed algorithm as a function of probabilistic symbol rank. In this case the alphabet size is M = 20.

3. Learning Synthetic Language Symbols

A Linguistic Constrained EM Symbolization Algorithm (LCEM)

In contrast to the constructive symbolization algorithms considered in the previous section, here we propose an adaptive neural-style learning symbolization algorithm suitable for symbolizing dynamical systems where the data are provided sequentially over time. In particular, we present an adaptive algorithm which constrains the symbols to have a linguistically framed probability distribution. The approach we propose is a symbolization method based on the well known Expectation-Maximization (EM) algorithm [53]. However, a problem with regular EM symbolization is that there is no consideration given to language constraints which might provide a more realistic symbolization with potential advantages. Hence, we derive a probabilistically constrained EM algorithm which seeks to provide synthetic primitives which conform to a Zipfian–Mandelbrot–Li distribution corresponding to the expected distribution properties of a language alphabet. The derivation of the Linguistic Constrained Expectation-Maximization (LCEM) algorithm is given in Appendix C. The way in which the LCEM algorithm operates is that a set of the clusters are initialized and then an entropic error is minimized progressively as a probabilistic constraint by adapting the mean and variance of each cluster. In this manner, a new weighting for each cluster is obtained which is a function of the likelihood and the entropic error. This continues until the cluster probabilities converge as indicated by the entropy or other some other criterion has been met, such as time to converge. The convergence performance of the LCEM symbolization algorithm is shown in Figure 4. In this case, the probabilistic surface of the clusters are shown during the adaptive learning process. The convergence of the symbol probabilities can be observed to converge to the ZML probability surface.

Figure 4

The performance of the LCEM adaptive linguistic symbolization algorithm is shown. In this case, the probabilistic surface of the clusters are shown as they adapt. After 30 steps, the convergence of the cluster probabilities (meshed surface) can be observed to converge to the ZML probability surface (shaded).

4. Example Results

4.1. Authorship Classification

In this section, we present some examples demonstrating the application of symbolization. It should be noted that our intention is not to validate symbolization using simulations, rather we simply present some potential applications which show that useful results can be obtained. We leave it to the reader to explore the potential of the presented methods further. The examples also show that symbolization by itself does not necessarily solve a task, but it can be an important part of the overall approach in discovering potential meaning when applied to real world systems. Hence, we can expect that symbolization will be only part of a much more comprehensive model. In this example, we pose the problem of detecting changing authorship of a novel without any pretraining. This is not intended to be a difficult challenge; however, it is included to demonstrate the concept of using symbols within an entropy-based model to determine some characteristics of a system based on a sequence of input symbols. For the purpose of the example, the text was filtered to remove all punctuation and extra white space, and hence The text was then symbolized with based on the word-length with In this case, the initial dataset consisted of the text from “The Adventures of Sherlock Holmes”, which is a collection of twelve short stories by Arthur Conan Doyle, published in 1892. This main text was interspersed with short segments from a classic children’s story, “Green Eggs and Ham” by Dr. Seuss published in 1960. The full texts were symbolized and the entropy was estimated using the efficient algorithm described in [33], applied to non-overlapping windows of 500 symbols in length. The way in which authors are detected is then based on measuring the short-term entropy of the input text features. We do not necessarily know in advance what the characteristics of the text will be; hence, simplified methods of measuring average word lengths are not so helpful. Hence, in this case, by assigning a symbol to different word length ranges, the idea is that the entropy will characterize the probabilistic distribution of the input features. To detect the different authors, we can introduce a simple classification applied to the entropy measurement. In this case, we use the standard deviation of entropy, but for a higher dimensional example, a more sophisticated classification scheme could be used, for example, a k-means classifier. The results are shown in Figure 5 where it is evident that the different authors are clearly identifiable in each instance by a significant drop in entropy when the different author is detected. Clearly this simple demonstration could be extended to multiple features using more complex classifiers; however, we do not do this here.

Figure 5

The proposed symbolization algorithm used in conjunction with a fast entropy estimation model readily detects different authors as shown here. An entire novel is represented across the horizontal axis.

4.2. Symbol Learning Using an LCEM Algorithm

The behavior of the proposed LCEM algorithm is considered in this section. A convenient application to examine is a finite mixture model where the cluster probabilities are constrained towards a ZML model. In this case, we consider a small synthetic alphabet which has a set of symbols. The performance of the LCEM algorithm when applied to a sample multivariate dataset for is shown in Figure 6.

Figure 6

The entropy convergence behavior of the proposed linguistic constrained EM algorithm. Here an example is shown of how the LCEM algorithm performs on a sample multivariate data set for .

Hence, the proposed LCEM algorithm is evidently successful in deriving a set of synthetic symbols from a multivariate dataset with unknown distributions by adapting a multivariate finite mixture model. It can be observed that the convergence performance of the proposed LCEM algorithm occurs within a small number of samples. Interestingly, since the optimization is based on the likelihood but constrained against the entropic error, the gradient surface is nonlinear, and hence we observe an irregular, non-smooth error curve. It is of interest to examine the convergence of the cluster probabilities and an example of the LCEM algorithm performance is shown in Figure 7 where the cluster probabilities are compared to the theoretical ZML distribution.

Figure 7

An example of the LCEM algorithm performance in adapting the cluster probabilities (solid line) as compared to the theoretical ZML distribution (dotted line) when near convergence.

4.3. Potential Translation of Animal Behavior into Human Language

This example is presented as a curious investigation into the potential applications of symbolization and synthetic language. Our interest is in discovering ways of modeling and understanding behavior using these methods, and so we certainly do not claim that this is a definitive method of animal behavior into a human language form. However, we found it somewhat interesting, even if speculative in the latter stage, and hence it is intended to stimulate discussion and ideas rather than provide a definitive solution to this task. In part, our application is motivated by the highly useful data collected by Chakravarty [54] from triaxial accelerometers attached to wild Kalahari meerkats. The raw time series data from one of the sensors over a time period of about 3 min are shown in Figure 8.

Figure 8

Behavioral time series data obtained from a triaxial accelerometer attached to a wild Kalahari meerkat recorded over about 3 min and sampled at 100 Hz.

Analysis of animal behavior has received considerable attention and recent work based on an information theoretic approach includes an entropy analysis of behavior of mice and monkeys using a two types of behavior [55]. A range of information theoretic methods including relative entropy, mutual information and Kolmogorov complexity were used to analyze the movements of various animals using binned trajectory data in [56]. An analysis of observed speed distributions of Pacific bluefin tuna was conducted using relative entropy in [57]. The positional data between pairs of zebra fish were analyzed using transfer entropy to model their social interactions in [58]. It is evident that information theoretic approach can yield a deeper understanding of animal behavior. A common aspect of these and other prior works that we are aware of, are that they are restricted to using entropy-based methods. In our case, we are considering the possibility of extending this approach further by treating the symbols as elements within a synthetic language. Hence, we are interested to raise the question of whether it is possible to obtain an understanding of biological or other behaviors using a synthetic language approach. To the best of our knowledge this approach has not been considered previously in this way. Hence, this will be demonstrative and exploratory in nature, rather than a definitive analysis. The first step in our example, is to symbolize the observed data. In contrast to most entropy based techniques, a larger alphabet size of symbols can be readily accommodated within the analysis; however, for the purpose of this case, we select a smaller number of samples which ensures the input range is fully covered. The CDF-ZML symbolization algorithm with a synthetic alphabet size of was applied to 10 s of meerkat behavioral data. A ZML distribution was generated according to (A1)–(A9). The symbolic pdf output is shown in Figure 9.

Figure 9

Applying the CDF-ZML symbolization algorithm to the meerkat behavioral data provided this symbolic pdf which is then used to symbolize the data.

Hence, the raw meerkat behavioral data are symbolized, where for convenience in this example, we use letters to represent each symbol. Note that we could also consider a richer set of input data as symbols. For example, using n-grams or frequency domain transformed inputs, or a combination of these methods. Observing a synthetic language requires determining letters, spaces and words. Hence, symbolization provides the first stage in discovering the equivalent of letters in the observed sequence. The next step is to determine spaces which in turn enables the discovery of words. However, even this simple task is not necessarily so trivial. In human language, it is generally found that the most frequent symbol is a space. Following a similar approach, we found that it is useful to determine which symbol or symbols separate words in the symbol sequence. In human speech or written language, a space or pause is simply represented by a period of silence. However, in behavioral dynamics, particularly when there is nearly constant movement, such a period of “silence” does not necessarily have a corresponding behavioral characteristic of no movement. In this context, we consider that the functional role of a space is essentially a “do-nothing” operation. Hence, in an animal species which displays almost constant movement, there are effectively six types of behaviors which we propose can constitute a functional space. These correspond to forward and reverse directions in each of the three axes. The actual identification of these functional spaces is then found by measuring the movements in each of these directions which occur with the highest frequency. Note that this does not mean that the animal is actually doing nothing. It means that in terms of an information theoretic perspective, the symbols have the highest relative probability of occurring and therefore convey little “surprising” information. As a first step, the entropy of the resulting symbolic sequence can be computed and is shown in Figure 10. This reveals some structure in terms of low and high frequency periodic behavior. Such periodic probabilistic behavior is to be expected in natural languages since it may correspond to random, but predictably frequent words with different probabilities [59].

Figure 10

Entropy of the Kalahari meerkat movement when symbolized using the proposed CDF-ZML algorithm. The quasi-periodic time series reveals structure in terms of low and high frequency periodic behavior corresponding to the predictably frequent occurrence of words with different probabilities.

A symbolic sequence obtained from the meerkat raw sensory data are shown in Figure 11. For convenience, the symbols are represented by letters which enables the visualization of the synthetic language words. In this case, the synthetic language spaces are replaced by regular spaces, which enables the synthetic language words to be viewed.

Figure 11

Symbolic output from the Kalahari meerkat movement data obtained using CDF-ZML symbolization algorithm. The symbols are represented as letters and the synthetic words are clearly evident.

When viewing this sequence of symbols perhaps the first question that can be asked is “how can this be understood”? The well known direct method is one approach for determining the meaning of the words within languages [60]. This method relies on a number of aspects; however, it principally requires some form of direct matching of real world objects or tasks and the related words. Consequently this causes some disadvantages including the difficulty of learning and the time required. Grammar translation is another approach for translating between languages and is based on knowledge of the rules, grammar and meaning of words in both languages [61]. However, for the task of learning the meaning of a new synthetic language for which we do not have an understanding of the language itself, this method is not likely to be useful. One possible approach which may be useful for translating synthetic language is to consider methods based of understanding the functional aspects of the language. The idea of communicative-functional translation is to view translation as related to communication between specific actors [62]. Hence, we suggest that a probabilistic functional translation method might be of interest to consider in this case. How can such probabilistic functionality be measured and adopted in such a potential translation application? It is clearly not possible to use the probabilistic structure across a large vocabulary, and universal structure is notoriously uncommon across languages [63]. However, we suggest that it may be possible to consider an approach of cross-lingual transfer learning based on probabilistic structure of parts of speech (POS) [64,65]. It is evident that despite the existence of a large number of POS across various languages, and some disagreement about the definitions, there is strong evidence that a set of coarse POS lexical categories exists across all languages in one form or another [66]. This indicates that while fine-grained relative cross-lingual POS probabilities may vary, when linked to the same observed linguistic instantiations, the cross-lingual probabilities of coarse POS categories are likely to be similar [67,68,69]. Therefore, while we have used probabilistic POS rankings from an English language corpora in this example, since we are using only coarse-grained categories, this means that the rankings might be expected to be stable across languages. There is some closeness of ranked probabilities between some POS categories and this presents some degree of uncertainty in the results. One approach to consider this further in future work would be to seek to introduce functional grammatical structure based on probabilistic mappings. That is, in the present example we are only considering very simple probabilistic rankings of coarse POS; however, a future approach might extend the concept of probabilistic rankings to more complex linguistic properties as employed in current POS tagging systems to disambiguate POS and reduce the uncertainty of the many possible English-language translations [70,71]. Hence, as a next step in this investigation, we determined the probabilistic characteristics of coarse-grained POS observed in English using the Brown Corpus. The normalized ranked probabilities are shown in Figure 12 and the specific types of speech are shown in Figure 13.

Figure 12

Normalized ranked probabilistic characteristics of parts of speech from the Brown Corpus. This is used as a first step in our proposed translation approach based on cross-lingual transfer learning based on probabilistic parts of speech.

Figure 13

The specific parts of speech obtained from the experimentally determined ranked probabilities of parts of speech from a range of corpora.

The synthetic words can be ranked according to probabilities and because the vocabulary is limited, we map these to corresponding parts of speech in human language with the same relative probabilistic rankings. This would not necessarily be easily done for large vocabularies, but because the synthetic language under consideration has a small vocabulary, we can readily form the mapping as shown in Figure 14. This approach assumes that there exists the same POS in both languages which correspond according to probabilistic ranking. However this appears to be a reasonable assumption when considering sets of coarse syntactic POS categories which omit finer-grained lexical categories as we do in this example [66,70,71,72].

Figure 14

The probabilistic rankings of the English language coarse syntactic parts of speech are used to map the synthetic language words with the same rankings.

While the cross-lingual POS mapping gives some possible insight into the potential meaning of the observed synthetic language, we thought it would be of interest to view a form of potential high probability words corresponding to the sequence of observed POS. Hence, we proposed the concept of visualizing what the sequence would potentially “look like” if it were written in a human understandable form. Therefore the next step is not intended to provide an accurate translation of the actual behavior, but rather as a means of trying to view one possible narrative which could fit the observations. Accordingly, we analyzed the Brown corpus to obtain the most frequent words corresponding to each of the coarse syntactic POS categories. We then assigned each of these most frequent words to the categories as shown in Figure 15. We do not claim that these words are what is actually “spoken” by the behavior, but provides a novel way of seeking to view a potential narrative for this example.

Figure 15

The synthetic language words can be mapped to potential parts of speech according to their relative probabilities. Then a potential human language translation can be formed by associating place-holder words with the synthetic language “words” corresponding to specific parts of speech.

In this way, instead of viewing the synthetic words such as “BCD”, “AB”, etc., we first map them to a coarse level syntactic POS. From this point, the POS are mapped to recognizable human language words. Although admittedly speculative, this last stage provides an interesting insight into how we might begin to understand the possible meaning of unknown synthetic language texts. The resulting output in human language is shown in Figure 16, where a simplistic, but recognizable synthetic dialog can be seen emerging.

Figure 16

The synthetic language of the meerkat behavior is first translated into coarse syntactic POS categories which omit finer-grained lexical categories. Then, using the most frequent probabilistically ranked POS words as place-holders, we can then provide a possible, but speculative narrative.

Interestingly, although not shown in the results here, the behavior of the meerkat can be viewed as doing nothing of interest for a period of time—almost like an extended period of “silence” and then followed by a period of high activity. It is in this high activity time that we show the output “translation” results. These last stage results are included as a curiosity. A next step from this point is to explore ways of ascertaining more reliable word translations. For example, embedding ground truth by direct learning is one possibility, though with distinct disadvantages as noted earlier. However, another more promising approach appears to be extending the concept of communicative functional grammar, which we can consider in terms of conditional probabilistic structures. However, these topics are beyond the scope of this paper and will be explored in the future.

5. Conclusions

Many real world systems can be modeled by symbolic sequences which can be analyzed in terms of entropy or synthetic language. Synthetic languages are defined in terms of symbolic primitives based on probabilistic behavioral events. These events can be viewed as symbolic primitives such as letters, words and spaces within a synthetic language characterized by small alphabet sizes, limited word lengths and vocabulary, and reduced linguistic complexity. The process of symbolization in the context of language extends the concept of scalar and vector quantization from data compression and fidelity to consider explicit linguistic properties such as probabilistic distributions and intelligibility. In contrast to human languages, where we know the language elements including letters, words, functional grammars and meaning, for synthetic languages, even determining what constitutes a letter or word is not trivial. In this paper we propose algorithms for symbolization which take into account such linguistic properties. We propose a symbolization algorithm which constrains the symbols to have a Zipf–Mandelbrot–Li distribution which approximates the behavior of language elements forms the basis of some linguistic properties. A significant property for language is effective communication across a medium with noise. Hence, we propose a symbolization method which optimizes a measure of intelligibility. We further introduce a linguistic constrained EM algorithm which is shown to effectively learn to produce symbols which approximate a Zipfian distribution. We demonstrate the efficacy of the proposed approaches on some examples using real world data in different tasks, including authorship classification and an application of the linguistic constrained EM algorithm. Finally we consider a novel model of synthetic language translation based on communicative-functional translation with probabilistic syntactic parts of speech. In this case, we analyze behavioral data recorded from Kalahari meerkats using the symbolization methods proposed in the paper. Although not intended to provide an accurate or authentic translation, this example was used to demonstrate a possible approach to translating unknown data in terms of synthetic language into a human understandable narrative. The main contributions of this work were to introduce new symbolization algorithms which extend earlier quantization approaches to include linguistic properties. This is a necessary step for using entropy based methods or synthetic language approaches when the input data are continuous and no apparent symbols exists in the measured data. We presented various examples of using the symbolization algorithms to real world data which demonstrate how it may be effectively applied to analyzing such systems.

21 in total

1. Structural determination of paraffin boiling points.

Authors: H WIENER
Journal: J Am Chem Soc Date: 1947-01 Impact factor: 15.419

Review 2. Crossover interference.

Authors: Kenneth J Hillers
Journal: Curr Biol Date: 2004-12-29 Impact factor: 10.834

3. Measuring the intelligibility of conversational speech in children.

Authors: Peter Flipsen
Journal: Clin Linguist Phon Date: 2006-06 Impact factor: 1.346

4. Emotion recognition based on physiological changes in music listening.

Authors: Jonghwa Kim; Elisabeth André
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2008-12 Impact factor: 6.226

5. A new phylogenetic diversity measure generalizing the shannon index and its application to phyllostomid bats.

Authors: Benjamin Allen; Mark Kon; Yaneer Bar-Yam
Journal: Am Nat Date: 2009-08 Impact factor: 3.926

6. Analysis and design of a decision tree based on entropy reduction and its application to large character set recognition.

Authors: Q R Wang; C Y Suen
Journal: IEEE Trans Pattern Anal Mach Intell Date: 1984-04 Impact factor: 6.226