Literature DB >> 35702594

Brauer and partition diagram models for phylogenetic trees and forests.

Abstract

We introduce a correspondence between phylogenetic trees and Brauer diagrams, inspired by links between binary trees and matchings described by Diaconis and Holmes (1998 Proc. Natl Acad. Sci. USA 95, 14 600-14 602. (doi:10.1073/pnas.95.25.14600)). This correspondence gives rise to a range of semigroup structures on the set of phylogenetic trees, and opens the prospect of many applications. We furthermore extend the Diaconis-Holmes correspondence from binary trees to non-binary trees and to forests, showing for instance that the set of all forests is in bijection with the set of partitions of finite sets.

Entities: Chemical

Keywords: Brauer diagram; partition monoid; phylogenetics; sandwich semigroup

Year: 2022 PMID： 35702594 PMCID： PMC9185836 DOI： 10.1098/rspa.2022.0044

Source DB: PubMed Journal: Proc Math Phys Eng Sci ISSN： 1364-5021 Impact factor: 3.213

Introduction

Phylogenetic trees are a fundamental and persistent idea used to represent evolutionary relationships between species for nearly two centuries. Their use extends beyond biology to the representation of language evolution, and to decision processes in algorithms. Their appeal is that they provide an extra dimension to the ways to relate the elements of a set beyond a linear order. They have been studied directly in numerous ways, through stochastic processes, combinatorics and geometry (e.g. [1,2]). An indirect but powerful way to study tree structures is to consider correspondences, or ways to represent trees, within other mathematical objects. This is a standard approach in much of mathematics of course (representation theory is defined by this idea). And there are several known bijections between rooted binary phylogenetic trees and other structures. For instance, trees correspond to certain polynomials [3,4], to perfect matchings [5], and to more general partitions of finite sets [6]. The latter has been extended to classes of forests relevant to phylogenetics via a correspondence between forests and trees, providing a way to enumerate such forests using Stirling numbers [7]. Other frameworks for capturing the combinatorics of tree counting and tree shapes have been through the analysis of binary sequences [8], via symmetric function theory [9], and numerous others (see for example the OEIS listing A001147 [10]). The particular connection with perfect matchings has prompted the suggestion by Diaconis & Holmes [5] that there may be a relation to certain diagrams that have independently arisen within several branches of algebra, notably those given by the Brauer algebra [11]. In this paper, we take up and develop the link to Brauer algebras, and develop correspondences between the related diagram structures and phylogenetic trees, that provide an algebraic framework for their study. Each labelled, rooted binary tree will correspond to a unique element in the Brauer category, whose elements can be represented by asymmetrical Brauer diagrams. More generally, a corresponding structure for non-binary trees can be defined using the associated partition category (for partition categories see for instance [12,13]), and we are able to extend this correspondence to the set of all forests. The extension to forests is associated with a new bijective correspondence between forests and partitions of finite sets that directly extends the ideas of Erdős & Székely [6] in a direction distinct from that in Erdős [7]. The Brauer diagram framework has the potential to reveal more structure within the space of tree shapes (sometimes called topologies), and to provide new methods to randomly move about that space (something important for many tree reconstruction algorithms). In particular, the inherited structure provides numerous ways to break down tree space: for instance, through the use of Green’s relations from semigroup theory, we can break the set of phylogenetic trees into -classes, -classes, -classes and -classes, each of which can be interpreted in the light of standard properties of trees (see §4 for definitions of these semigroup classes). A concomitant of this is the possibility to craft new metrics on tree space. Finally, we are able to define an operation on tree-space, effectively a multiplication of trees, relative to a fixed tree, that turns the space into a semigroup, using the ‘sandwich semigroup’ product. It is then possible to consider the ‘regular’ elements of this semigroup, that provide a subsemigroup of the set of trees with respect to the chosen fixed tree. The paper begins in §2 with some background to the algebraic structures we will be using to describe tree space, namely the Brauer algebra and related monoids. We also recall key concepts from semigroup theory, especially Green’s relations and the corresponding equivalence classes into which a semigroup decomposes. We then describe (§3) the matching result of Diaconis & Holmes [5], and show how it extends to a correspondence with Brauer diagrams. Section 4 then explores the result of transferring the results on the analysis and properties known in the Brauer category, across to phylogenetic trees, and discusses the rich new structure for tree space that becomes available. Section 5 introduces a product on trees, relative to a given tree, derived from the definition of the sandwich semigroup on Brauer diagrams, and explores various algebraic consequences of this powerful concept, including for example the characterization of related entities such as the associated regular subsemigroup of trees. Finally, §6 sketches a further broad generalization of the correspondence, between non-binary trees, and associated partition diagrams and the partition category. A key result is a correspondence results for trees and forests with sets of partitions of finite sets (theorem 6.5). We end with a discussion of directions that this algebraic landscape for phylogenetic trees might open up for further research (§7).

Background on phylogenetic trees, matchings and semigroups

For a set of cardinality , a rooted (binary) phylogenetic -tree is a graph with labelled, valence 1 leaf vertices labelled by , together with unlabelled, valence three, internal vertices, and an additional unlabelled valence 2 root vertex. Note that we do not consider the root to be an internal vertex. We will usually assume without loss of generality that . If is a vertex of we write for the set of children of : vertices that are the targets of edges whose source is , and we say is the parent of the vertices in . The cardinality of the set of such leaf-labelled binary trees is . This and other tree-related counting problems are well studied [6,8,14] (see also [9,15,16]). In the latter part of this paper, we will be working with trees that are not necessarily binary. These satisfy the same properties as the binary -trees, except that the internal vertices are not necessarily of valence 3 (for which we use the term non-binary trees[1]): instead they have in-degree 1, and out-degree at least 2. The set of all rooted phylogenetic -trees is denoted . In some contexts, we also permit , in which case the ‘trivial’ tree is an isolated leaf labelled by the element of , and which is also the root. Note, the trivial tree is not an element of or , which are restricted to trees on leaves. An -forest is a set of trees whose leaf sets partition . The components of a forest are the individual trees that make it up. The trivial forest on is the one in which all component trees are trivial; that is, a set of isolated leaves labelled by the elements of . A perfect matching on a set is a partition of the set into pairs, and the number of perfect matchings on elements ( even) is . This gives rise to a natural correspondence between rooted phylogenetic trees, and perfect matchings, on elements (in this case, the leaf set , augmented by an additional elements, ). This correspondence was formalized as a bijective map by Diaconis & Holmes [5], crystallizing that in Erdős & Székely [6], and is described as follows. Associating the leaves of with the first nodes of the perfect matching, the initial set of pairings among all elements is simply generated from the cherries of the tree (leaves which share an immediate common ancestor), which are at most in number. The next unmatched node, , is assigned to the internal tree vertex, both of whose child vertices are already labelled, one of which has the numerically lowest label. The next lowest available node label for any unmatched internal tree vertex or vertices is in turn assigned, among those with the already labelled child vertices, to the one containing the numerically lowest child. The process repeats until the node , corresponding to the last unlabelled internal tree vertex, is finally identified with its partner. An example, with a chord diagram representation (see below) of the node matchings, used in [5], is given in figure 4. This algorithm is formalized in pseudocode in algorithm 1.

Figure 4

The six-leaf tree from figure 3, and its corresponding chord diagram representation.

The correspondence with perfect matchings gives rise naturally to a different family of diagrams, namely those in the Brauer algebra , which (unsurprisingly) has the same dimension, and the partial Brauer monoid . In the next section, we introduce these algebraic structures, and describe the correspondence. A (set) partition of a finite non-empty set is a set of pairwise disjoint subsets of , whose union is . If is a set partition of , write for the number of subsets in , and write . An (integer) partition of a positive integer is a multiset of positive integers whose sum is . In this paper, we will use ‘partition’ to refer to set partition unless otherwise noted. A semigroup is a set with an associative operation. A simple example is the set of positive integers with the operation of addition. If the semigroup has an identity element with respect to its operation, it is called a monoid. An example is the non-negative integers , in which the identity with respect to addition is of course 0. Later in the paper, we will introduce the semigroup notions of Green’s relations and ‘eggbox diagrams’ that represent these relations. For an introduction to some of these notions from semigroup theory, we recommend [17,18].

Connections between trees, the Brauer algebra and Brauer diagrams

In the first part of this paper, we develop the correspondence between binary phylogenetic trees and perfect matchings via a further diagrammatic setting, that was already referred to in [5]: namely, by exploiting diagrams linked to the Brauer algebra. These will share a basic structure and fundamental elements (the generators of the algebra), and so we briefly introduce the Brauer algebra itself, before moving to the specific set of diagrams that will be our focus.

Introduction to Brauer diagrams

For each the Brauer algebra is the complex algebra generated by the elements [11] subject to the defining relations given in appendix A. For the purposes of links to phylogenetics, we will make two significant restrictions to this generality. First, we will restrict to , and focus on (although the potential to extend the ideas we discuss here to the full Brauer algebra makes an interesting question that we leave to future work). Second, we will treat and related objects (see below) as monoids (or partial monoids), working with just the basis elements of the algebra. These have a standard diagram transcription, in which elements are associated with graphs consisting of nodes: ‘upper’ nodes and ‘lower’ nodes with edges specified as set partitions of consisting of pairs of nodes. In particular, contains the pairs with all remaining nodes , sequentially paired, while contains , with all remaining nodes sequentially paired (figure 1).

Figure 1

Diagrammatic representation of the generators and , for .

Diagrammatic representation of the generators and , for . The (associative) product is formed by diagram concatenation, with edges joined via identification of the lower nodes of the first (upper) multiplicand, and the upper nodes of the second (lower) multiplicand. An example is shown in figure 2.

Figure 2

Multiplication of Brauer diagrams in .

Binary phylogenetic -trees and their correspondence with Brauer diagrams

Given a binary tree on leaves, once its internal vertices are labelled according to algorithm 1, we immediately have a matching on the set , obtained by pairing labels on sibling vertices [5,6]. The reverse direction—constructing a tree from a matching—begins with laying out the leaf vertices and then pairing matched vertices that are already there, in increasing order. Informally, the relation to matchings can be seen as the result of the growth of a binary tree via successive bifurcations: starting from a simple tree with two leaves (a single ‘cherry’), with matching , one or the other edge suffers a bifurcation, giving for example (a matching of ), whereupon 2, the parent (still paired to its original partner) becomes an internal node, with the children a new pair of nodes. In this way, by iteration, a correspondence is built with matchings, or indeed Brauer diagrams, provided a unique convention for internal node enumeration is given (tantamount simply to having ‘unlabelled’ internal nodes), as has been described above following [5]. The root location is inferred from the last matched pairing. Through the bijection between binary trees and matchings described in §2, structural features of the Brauer monoid are induced on trees. Each binary tree on leaves uniquely defines a matching on , and each such matching uniquely defines a Brauer diagram in by connecting nodes labelled by integers paired in the matching (see example in figure 3). Write for the Brauer diagram corresponding to .

Figure 3

The six-leaf tree on the left corresponds to the matching , giving the Brauer diagram shown on the right.

The six-leaf tree on the left corresponds to the matching , giving the Brauer diagram shown on the right. The six-leaf tree from figure 3, and its corresponding chord diagram representation. If the product of two elements is written using juxtaposition, then there is an induced product on trees, defined by Similarly, the -involution in , defined on pairings by interchanging , for all , or on diagrams by reflection about the horizontal axis, induces an involution on trees, For example, the tree in figure 3 satisfies , because its diagram is symmetric about the horizontal axis. The bijection from trees of course extends to a variety of representations of the Brauer monoid or combinatorial objects tied thereto. From the perspective of matchings of an even set, for example, it is more natural to label elements as rather than marking them as (as already done in establishing bijection ), and a diagram correspondence could simply be established via arcs between a linear arrangement of nodes . A less biased arrangement is a chord diagram, with arcs linking an even number of nodes arranged in a circle, as in [5]. Relative to a fixed ordering, such a diagram represents a set of transpositions, an element of the symmetric group . Thus, the chord diagram in figure 4 corresponding to the six-leaf tree of figure 3, is obtained by bending up the ends of the lower rail of nodes of the corresponding Brauer diagram, so that they join on to the upper rail of nodes . Associating a labelled tree with an element of the symmetric group in this way, as a product of transpositions coming from the matching, confers yet another possible multiplicative structure (exploited in the transposition distance for trees [19]). For manipulations on trees, in this paper, we will use a modified bijection, not between and , but between and , one of the equivalent diagrammatic presentations of matchings. In practice, the original and the modified bijection (for which we use the same symbol) are the same algorithmically, with the difference that the nodes on the upper edge correspond to the leaves of the -leaf trees, and the nodes on the lower edge are those reserved for the corresponding internal tree nodes (as mentioned, with the root location being inferred from the last pairing). It is also convenient to continue the numbering from top to bottom in clockwise fashion. See figure 5 for the Brauer transcription in of the six-leaf tree whose presentations in , and as an element of (via a chord diagram), have been given in figures 3 and 4, respectively.

Figure 5

The six-leaf tree from figure 3, and its corresponding Brauer element . For this diagram, we have , , and and rank.

The six-leaf tree from figure 3, and its corresponding Brauer element . For this diagram, we have , , and and rank. With these conventions, the setting of trees in as a semigroup, afforded by the transcription to , is supplanted by the categorical setting of partial monoids [20-22], where a multiplication on exists only if (compare [23]). In practice, we analyse , and hence , via left- and right-multiplication by , , respectively, and the associated equivalence classes. Moreover, via multiplication with the help of intermediate ‘sandwich’ elements belonging to , a semigroup structure can indeed be reimposed (see §5). This turns out to admit a rather universal description, allowing further structural aspects among the participating trees to be distinguished. We now briefly introduce some language for describing features of a diagram , that echoes that for functions, as follows. A block of a diagram is a connected set of nodes in . A transversal is an edge that passes between the top row and the bottom row. The domain of a diagram , , is the set of points along the top of the transversals, a subset of . The rank of is . The codomain, , is the set of points along the bottom of the transversals. The kernel and cokernel are defined slightly differently, for technical reasons which will become apparent later and Note that if is a block in with , then both and are in . An example is given in figure 5. We will occasionally want to refer to the underlying set of the kernel or cokernel. By this we mean the set of all points in or respectively that appear in relations in the sets. That is, for a set of binary relations , . With these definitions, we note the following properties that hold for such a diagram , that will be used in the sequel and noting that , and also the numerical relations and

Structuring tree space via Green’s relations

As mentioned above, with the identification with phylogenetic trees in as Brauer diagrams of type via the bijection , the correspondence with the monoidal structure in , afforded by the bijection , is no longer direct. Rather, it is supplanted by the categorical setting of partial monoids [20-22], where a multiplication on exists only if . In practice, we analyse , and hence , via left- and right-multiplication by elements of and , respectively, and the associated equivalence classes. It is important to note here that the labels on the nodes that are matched in the product are implicitly reassigned, along the lines of the description in §3a, so that instead of bottom nodes reading from left to right in an element of , we treat them as though labelled . The multiplication on the bottom by an element of then matches along the bottom of one diagram with along the top of the other. As in the previous usage, we pull the appropriate multiplications back to trees via the bijection and Examples of the results of multiplication of five-leaf trees by Brauer generators are shown in table 1. In semigroup theory, elements that can be obtained from each other by left- and/or right-multiplication are classified according to Green’s relations (see [18,24]). In the context of phylogenetic trees, we find a restricted version of these relations to be most useful, in which the actions are by elements of the symmetric group on the left (top of the diagram) and on the right (bottom), as opposed to the full Brauer monoid or . We define these restricted adaptions of Green’s relations as follows:

Definition 4.1.

Green’s , , and relations are defined as follows, for : In other words, the -class containing in is the set of diagrams that can be reached from by a left multiplication (by an element of ). Likewise, the -classes arise from right multiplication by elements of . The -classes are sets of elements that are both and -related, and these form the smallest of this family of equivalence classes. By contrast, the classes are unions of intersecting and -classes. These classes are displayed using arrays called eggbox diagrams (see ch. 2 of [18]). Each -class can be represented as a rectangular array whose entries are -classes. The -classes are then given by the rows of the -class, and the -classes by the columns. Note that the actions by and are not faithful, because for instance the action of a transposition on a cup between and has the effect of the identity action (for example, in table 1, ). Because top actions cannot affect the bottom of a diagram, all diagrams in an -class have the same bottom half. And since the action of on the top connects all arrangements of the top half of the diagram that have the same rank ( is the full symmetric group), each -class is the full set of diagrams with a particular bottom half. Likewise, each equivalence class is indexed by the common top half of the diagrams within it. The -classes are then those diagrams that have the same top and bottom halves (the intersections of and -classes), and there will be of these, where is the rank of the diagrams in the class (this counts the number of ways to join into transversals the set of half-strings that come down from the top half, with those coming up from the bottom half). The -classes are unions of and -classes, namely all diagrams of the same rank. Figure 6 shows a Green eggbox scheme displaying Brauer diagrams (and corresponding trees) for the -classes of the Brauer monoid corresponding to six-leaf trees. The class with two cherries (rank 2) is shown in detail, to illustrate the coordinatization of trees provided by the Brauer structure. This rank two class [20-22] and eggbox represents a total of 540 trees, with six rows (-classes) and 45 columns (-classes), with 270 intersections (-classes, of cardinality ). The corresponding eggbox for rank 0 (three cherries), this eggbox (one cherry, rank 2) and the eggbox for rank 4 (caterpillar trees, with one cherry), together enumerate the totality of labelled six-leaf phylogenetic trees on six leaves, respectively.

Figure 6

An eggbox diagram illustrating the enumeration of elements of the -classes of the Brauer monoid , representing six-leaf trees. Upper diagram: schematic illustration of the separate eggboxes for the classes , and , of Brauer ranks , corresponding to trees with or 1 cherry, and comprising and 360 trees, respectively (see text for discussion). Lower diagram: a ‘close-up’ of the eggbox for trees with two cherries (and Brauer rank 2). Row labels show bottom halves of Brauer diagrams, representing equivalence classes by left action (on the top of a diagram), the -classes, whereas column labels show top-halves of diagrams, representing equivalence classes by right action, the -classes. The inset shows two labelled trees, corresponding to the two diagrams belonging to the selected -class (the intersection of the selected row and column).

Remark 4.2.

These equivalence classes have natural interpretations in terms of trees. The -classes are those trees that are the same up to permutations of leaf labels: their diagrams can be reached by multiplication by an element of along the top (which corresponds to the leaves). The -classes are those that have the same cherry structure at the leaves, but whose internal vertices have been permuted. The -classes represent trees with the same cherries but whose non-cherry leaves have been permuted. Finally, the -classes represent all trees with the same number of cherries (namely , where is the rank).

Multiplicative structure: the set of phylogenetic trees as a semigroup

In using a bijection between trees and an algebraic object like a Brauer monoid, the pay-off is the algebraic structure that comes to the set of trees. In choosing to use the unbalanced diagrams of , we preserve information about the tree structure (with the leaves all along the top axis of the diagram), but as noted above, we lose the capacity to multiply diagrams in the way that is possible if the top and bottom axes have the same number of nodes.

The sandwich product

There is, nevertheless, still a multiplicative structure available to unbalanced diagrams such as , namely the sandwich product. This requires a fixed diagram (tree) , and then allows the product of two diagrams and to be composed by inverting (flipping it in the horizontal axis) to obtain a diagram , and sandwiching it between the two trees where is composition of diagrams. This product then allows us to define a semigroup relative to , which we denote . Figure 7 illustrates such a sandwich product in terms of Brauer diagrams in while the induced operation at the level of the corresponding trees is shown in figure 8 (as can be seen, in this case the examples show a sandwich-square, of the form ).

Figure 7

The sandwich product of two Brauer diagrams in , relative to a third (in the middle on the left). The corresponding trees are shown in figure 8.

Figure 8

The sandwich product of two trees relative to a third (with the bar over it). This product is computed using the diagram product of the corresponding Brauer diagrams, as shown in figure 7.

The sandwich product of two Brauer diagrams in , relative to a third (in the middle on the left). The corresponding trees are shown in figure 8. The sandwich product of two trees relative to a third (with the bar over it). This product is computed using the diagram product of the corresponding Brauer diagrams, as shown in figure 7. Interestingly, the sandwich semigroups relative to different trees are isomorphic if the trees have the same rank (the number of transversals in the diagram) [20, Theorem 6.4]. This means that the choice of tree as sandwich is only important (in terms of semigroup structure) up to its rank. Note that if a diagram has rank , then its composition with any other diagram must have rank at most , because composing diagrams cannot generate additional transversals. In particular, the sandwich semigroup relative to tree does not contain an identity in general, because no sandwich product with a tree of rank greater than rank can ever return a tree of the same rank as . As a consequence, the sandwich semigroup is not a monoid.

The regular subsemigroup relative to a given tree

Let be an leaf tree. Using the sandwich product defined above, the set of trees with operation relative to is called the sandwich semigroup, denoted . The ‘regular’ elements of this semigroup form a subsemigroup . A semigroup element is said to be regular if there exists an such that . For the sandwich semigroup, this is equivalent to the property: is regular if and only if has the same rank as [20]. The regular elements in this Brauer sandwich semigroup can be characterized as follows (we extract the case relevant to this context).

Proposition 5.1. ([20] proposition 6.13)

Here, the join of two equivalence relations is their join in the lattice of equivalences, that is, the smallest equivalence relation that contains their union. The join separates if each equivalence class in contains at most one element of . The elements of can be enumerated according to their rank relative to that of , as follows:

Theorem 5.2. ([20] corollary 6.17)

The cardinality of the regular subsemigroup of for any diagram of rank , is given by

Example 5.3.

The tree whose diagram in is has rank (figure 9). The sum in theorem 5.2 is over , and can be computed as follows:

Figure 9

The Brauer diagram and the tree from the partition in example 5.3.

The Brauer diagram and the tree from the partition in example 5.3. It is interesting to note the number of elements of each rank: 45 of rank 0, 504 of rank 2 and 216 of rank 4. The total number of diagrams of these ranks is, respectively, 45, 540 and 360. In other words, this regular subsemigroup contains all trees of rank 0, 504/540 () of rank 2 and 216/360 (60%) of rank 4. We can dig a little further into this counting using the conditions in proposition 5.1. Given that and , the condition for to be in this subsemigroup are that both: separates ; and separates . The first of these is trivially satisfied since the underlying sets of and of are disjoint (they partition the set ). The second condition is more easily approached by considering when it will not hold, namely when the set underlying has two or more elements in common with . Since and have disjoint underlying sets, this forces . These diagrams must have domain of size 2 or 4, and some simple counting gives the number with domain size 2 as 36, and the number of domain size 4 as 144, for a total of 180 diagrams not in the regular subsemigroup. This gives a total number of diagrams of , which is indeed the number of rooted trees on six leaves (which is , and here ). There are several interesting questions related to the regular subsemigroup with respect to a tree, that we will leave for further work. For instance, the regular subsemigroup of constitutes a type of neighbourhood of (noting that can be easily checked using proposition 5.1 to be regular with respect to itself). There are many ways to define a neighbourhood of a tree, for instance using operations on trees like the nearest neighbour interchange (NNI) and subtree prune and regraft moves, or the transposition distance, also based on matchings [19,25]. It would be interesting to know the relationships among these neighbourhoods. Second, we can observe that in the case of example 5.3, all trees of rank 0 are in the regular subsemigroup. Is this a general property? Are there properties of a tree that make it regular with respect to a large proportion of other trees?

Non-binary trees and partition diagrams

In this section, we extend the results from binary trees to all trees , and to forests, taking advantage of a more general family of diagrams and an associated algebraic structure, called a partition monoid [20]. We begin with the generalization to trees where the binary constraint is lifted (so that internal vertices and the root may have out-degree greater than 2).

Non-binary trees

Recall that the underlying correspondence for binary trees on leaves is that a tree corresponds to a matching on the set : a partition of the set of non-root vertices into components of size 2. The generalization to non-binary trees maps a tree to a partition of the set of non-root vertices into subsets of size (note that the number of non-root vertices will be less than if the tree is not binary). For this reason, we will again exclude the trivial tree and require . The generalization begins by observing that algorithm 1 applies without change when the input is a tree that is not necessarily binary (as in [6]). We then obtain a partition from a tree by taking a subset to be a set of sibling vertices (having the same parent), and the correspondence for non-binary trees immediately follows the same algorithm as for binary trees (figure 10).

Figure 10

A partition corresponding to a non-binary tree. As with binary trees, non-leaf vertices in the tree are numbered in sequence using algorithm 1, choosing at each point the internal vertex whose children are all numbered and which has the lowest, numbered, child vertex. Let denote the set of partitions of a set of objects, and the set of those partitions for which all constituent subsets have at least two elements. We will also refer to , the set of those partitions for which all components have size exactly two. Note that is precisely the set of matchings on elements (in this case must be even). Write for the set of all partitions of a finite set, and and for the corresponding sets when the sizes of components are at least 2 or exactly 2, respectively. We now introduce partition diagrams, which are generalizations of the Brauer diagrams defined above from partitions that are matchings to more general partitions. Recall that Brauer diagrams in have nodes in two rows with nodes along the top numbered left to right 1 to , and nodes along the bottom, numbered right to left to . Nodes that are paired in the matching are connected by an edge. A partition diagram for a set partition of has nodes along the top numbered from left to right 1 to , and nodes along the bottom numbered right to left to . Nodes are connected by edges if their labels are in the same constituent subset of the partition. If there are nodes in the subset, we do not draw all edges, but instead draw the minimal number to show their common membership, which will be edges. Examples are shown in figures 11 and 12.

Figure 11

Obtaining a non-binary tree from a partition with components of size at least 2, via a partition diagram. Here, the partition is given in example 6.2.

Figure 12

(a) A diagram that does not correspond to a tree, because the corresponding partition has but the diagram does not have , as required by lemma 6.1. (b) The diagram of the same partition, but with the correct value of . (c) The corresponding tree.

Obtaining a non-binary tree from a partition with components of size at least 2, via a partition diagram. Here, the partition is given in example 6.2. (a) A diagram that does not correspond to a tree, because the corresponding partition has but the diagram does not have , as required by lemma 6.1. (b) The diagram of the same partition, but with the correct value of . (c) The corresponding tree. Write for the set of partition diagrams with nodes along the top and along the bottom, and for the subset in which each partition is from . The set is the subset of in which and all blocks have size exactly 2. That is, . For (that is, not necessarily binary), write for the corresponding element of , and for the corresponding set partition of . Recall that is the number of subsets in the partition of , and is the cardinality of . The following lemma is the analogue of the property for diagrams from binary trees (that all satisfy ).

Lemma 6.1.

If , then the diagram satisfies .

Proof.

The blocks in the diagram correspond to sets of vertices in that have the same parent in , therefore, they are in one-to-one correspondence with the set of non-leaf vertices in . The non-leaf vertices in are represented by the numbered nodes along the bottom of the diagram, with the exception of the root of . Therefore, . Note, this result means that given a partition of an integer with blocks of size at least 2, we can compute the values of and that give a tree corresponding to .

Example 6.2.

Consider the partition of a set of 12 elements given by Since there are four blocks, lemma 6.1 implies . The partition diagram and corresponding tree on nine leaves with three internal non-root vertices are shown in figure 11. While it is clear that each non-binary tree may be expressed as a partition diagram, it is not the case that every partition diagram is obtained from a tree, because some will not satisfy lemma 6.1. The same observation holds, of course, for binary trees. For example, the diagram in figure 12a does not represent a tree. However, the partition it displays, , corresponds to a tree via a diagram that we can find using lemma 6.1 as follows (noting that and there are two blocks of size 3): and . The corresponding diagram and tree are shown in figure 12b,c. We are now able to generalize the correspondence between binary trees and matchings, to non-binary trees and sets of partitions. Recalling that is the set of partitions of a set of elements into subsets of size , let Note that by using the fact that , this may also be written

Theorem 6.3.

There is a 1-1 correspondence between the set of partitions of finite sets into components of size , and the set of (non-trivial) rooted phylogenetic trees. Note, the ‘set of partitions of finite sets’ is not self-referential because the set of such partitions is infinite. This result is a direct corollary to theorem 6.5 below. In each direction, a partition diagram may be constructed using the partition and the values of and , so we also have as a consequence the following corollary. Let

Corollary 6.4.

The set is in bijection with the set of partition diagrams . The correspondence in theorem 6.3 provides the potential for new ways to enumerate the set of rooted phylogenetic trees, by decomposing the set of partitions. For example, the set of all partitions of ordered sets into blocks of size is naturally sliced up according to the size of the ordered set, . In terms of trees on leaves, this groups them according their number of non-root vertices. In light of the above bijections, trees with particular characteristics such as this are able to be counted via the partial Bell polynomials [16] , whose monomial coefficients count the number of set partitions of , with blocks with specific frequencies. Note that , so that these are polynomials in indeterminates. Thus the above sequence of trees sliced by the size of the ordered set, is given by , whose first few terms are (sequence A000296 of the On-Line Encyclopedia of Integer Sequences [10]), so that for instance there are partitions of a set of size 5 into partitions without singletons: ways to split into subsets of size 3 and 2 (trees on four leaves with one internal vertex), and one way to have a subset of size 5 (the star tree on five leaves). Similarly, the total number of trees with bifurcations or trifurcations only is the sequence , whose first few terms are (sequence A227937 of [10]). The former decomposition together with the correspondence in corollary 6.4 can be represented in the diagram in figure 13.

Figure 13

The correspondence between sets of phylogenetic trees and sets of partitions described in theorem 6.3, showing how the sets of partitions decompose the sets of trees. An example of this decomposition for is shown in figure 14.

Figure 14

The decomposition of into sets of partitions. Examples of corresponding partition diagram shapes are in the centre column, with example corresponding tree shapes in the right-hand column. The number of trees in each category is shown on the right, for instance, in the second row, there are diagrams, and of course the last row is , giving the Ward numbers [26] [10, Seq. A269939]. Note, these are just examples and there are other possible diagram and tree structures with, for instance, two internal vertices (partitions of into three blocks). This decomposes the set of all trees on five leaves according to the numbers of internal vertices, indicated by the number of nodes along the bottoms of the diagrams.

Forests

The correspondence given in §6a between trees and partitions applies to partitions with non-trivial subsets. Recalling that subsets in a partition correspond to sibling vertices in a tree, a natural interpretation for a singleton (trivial) subset is that it corresponds to a vertex with no siblings. Given the definition of a phylogenetic tree used here (and elsewhere) excludes non-root vertices of degree 2, the natural interpretation for a singleton subset is that it corresponds to a root vertex.[2] And therefore, a diagram with singleton vertices ought to correspond to a forest. Indeed, as in theorem 6.5, forests of phylogenetic trees on leaves provide a one-to-one correspondence with a set of partitions defined in equation (6.3) below. Let denote the set of -forests, that is the set of forests whose leaves are labelled by elements of the set , with . -forests are graphs whose connected components are rooted phylogenetic trees, whose leaves partition . Note that unlike the families of rooted trees, for forests we are allowing . Write for the number of non-trivial blocks of the partition . Following the definition in equation (6.1), define and Note, in , . Let denote the set of all forests on leaves, and the set of all forests.

Theorem 6.5.

is in bijection with the set of partitions ; is in bijection with the set of partitions ; and The set of non-trivial forests on leaves with components is in bijection with the set of partition diagrams with singleton nodes that satisfy . (1) If is trivial, so that all its trees are isolated leaves, it will map to the trivial partition of , that is, . So we need to prove the correspondence between non-trivial forests and non-trivial partitions. Each (non-trivial) forest gives a partition, by first numbering non-leaf vertices according to algorithm 1, and then forming sets of sibling vertices, with labelled root vertices forming singletons. As with trees, this algorithm leaves a single root vertex un-labelled. We now explore the properties of the partition arising from a forest, to help in constructing the map back from partitions to forests. Suppose is the partition obtained from the forest . If is non-trivial, we have a correspondence between non-leaf vertices in , and non-trivial blocks, given by the children of each non-leaf vertex. It follows that the number of non-leaf vertices in is precisely . The number of all vertices in a non-trivial forest is , because one vertex (one of the roots) is left unlabelled by algorithm 1. The vertices in are also either leaves or non-leaves, and so this number is also equal to . Therefore, and so . Now consider a non-trivial partition . We will show how a forest can be constructed from . Set , and create a starting forest consisting of vertices as leaves, labelled . We will successively add vertices and edges to the forest as follows. First, set , and consider the non-trivial sets in . We have assumed that is non-trivial, so there will be at least one. We claim that at least one of these is contained in . Observe that there are integers outside and included in . But there are non-trivial subsets, and so at least one cannot contain an element outside , as required. Let be the set of non-trivial subsets in contained in (that label vertices in ), and let be the element of containing the least integer. This is well-defined because, as argued, is non-empty. Add a vertex to , as a parent to the vertices labelled by the elements of , to create a new forest . Then remove from to define If has no non-trivial subsets, then end the algorithm and output . (In this case, we will have had , and so and all integers in are labelling vertices in .) Otherwise, label the vertex in by . has leaves labelled and one other vertex labelled , which is the parent of at least two of the leaves. As before, we claim that contains a set that is a subset of , and the argument naturally extends as follows: Thus, the set of non-trivial subsets in contained in is non-empty. Choose the subset in with the least integer, and call it . has non-trivial subsets; There are integers in outside of ; and Therefore, it is not possible for all non-trivial subsets in to include an element outside . This process can continue, as described in algorithm 2, until we reach a point where has no non-trivial subsets and we output the resulting forest. The forest will have leaves, and additional vertices, of which will have labels: one root vertex will remain unlabelled. Thus all elements of will be labelling vertices, since . Note that singletons in the partition are also labelling vertices in the forest. Any singleton in that is is already labelling an isolated leaf from the outset, and so represents a trivial tree. Any singleton greater than will be labelling a root vertex in a tree in the forest, because it will be assigned as a parent of vertices in a block, and will not be assigned a parent because it has no other elements in its block. See example 6.6 for an illustration of these observations. The process deterministically defines a forest on leaves whose vertices (except one root) are labelled by the elements of the partition , and completes the proof of (1). (2) immediately follows from the construction described above, which gives a correspondence between a forest with leaves and a partition satisfying the condition to be an element of . For (3), suppose has leaves and component trees, and is non-trivial. Labelling the vertices according to algorithm 1, all but one of these trees will have a labelled root, and so the corresponding partition will have singletons. As noted above, the number of non-leaf vertices is , because they correspond to sets of siblings, which correspond to blocks of the partition. This includes the single non-labelled root vertex, and so the number of non-leaf labels in the partition for is , and this is the value of in the partition diagram. But since the partition has singleton sets, we have , and it follows that as required. The reverse direction takes a diagram to a partition that then constructs a forest according to algorithm 2. The conditions in the statement follow.

Example 6.6. (Construction of a forest from a partition via algorithm 2)

Take the partition given by Here, we have , , and . Set to be the forest with nine isolated leaves labelled , and set . We have , and so . Let be the forest obtained from by adding a vertex that is a parent of the vertices labelled 3, 4, 6. Set . Since has non-trivial subsets we label by 10, and , with . Let be the forest obtained from by adding a vertex that is a parent of the vertices labelled 1,10. Set . Since has non-trivial subsets we label by 11, and , with . Let be the forest obtained from by adding a vertex that is a parent of the vertices labelled 7,8. Set . Since has non-trivial subsets we label by 12, and , with . Let be the forest obtained from by adding a vertex that is a parent of the vertices labelled 2,12. Set . Since has non-trivial subsets we label by 13, and , with . Let be the forest obtained from by adding a vertex that is a parent of the vertices labelled 9,13. Set . Since has no non-trivial subsets we end the algorithm and output . The forests generated in this example are shown in figure 15.

Figure 15

The forests generated using algorithm 2, to obtain a forest from the partition , as described in example 6.6, outputting .

The forests generated using algorithm 2, to obtain a forest from the partition , as described in example 6.6, outputting . Note, theorem 6.3, relating to non-binary trees, is a special case of this theorem for those forests that consist of a single tree, and partitions without singletons. The correspondence with forests given in theorem 6.5 creates a broad set of correspondences between sets of partitions and sets of phylogenetic objects, as shown in figure 16. Recall that we have defined the following sets: Here might be empty (no restrictions), or (2) or , meaning components must be of size 2 or at least 2. But in the next two corollaries, we see correspondences for when components have size at most 2.

Figure 16

Correspondences between sets of trees or forests, all on leaves, sets of partitions and partition diagrams. In the left column, the components of the partition are all size exactly 2, and so .

Corollary 6.7.

The set of binary forests is in bijection with the set of partitions : those whose subsets have size at most 2.

Corollary 6.8.

The set of binary forests on leaves is in bijection with the set of partitions .

Semigroup structure

The semigroup structures that we have described for binary trees in §5, also extend to non-binary trees or forests, with some caveats. For instance, immediately we see that an action by the Temperley–Lieb generators will not be able to be defined for non-binary trees in general: the product of by a partition diagram that includes a block , where is on the opposite side to and , will result in a block of size one, namely . So the action of a Temperley–Lieb generator on a non-binary tree may result in a forest. However, action by the symmetric group generators does preserve the restriction on the partition diagram. In fact, action by preserves the number and size of the blocks in the partition. Therefore, we are still able to construct the eggbox diagram decomposition of the set of non-binary trees, and indeed for forests. This decomposition will have a richer structure, however, as the -classes are not determined simply by rank, but by other factors (the number and size of the blocks in the partition). This is a topic we leave for further investigation. The sandwich semigroup construction allows multiplication of non-binary trees, as it does in the binary case, although in general the product will not be closed. That is, if one of the trees involved in the sandwich product is not binary, it is possible that the product of diagrams results in an isolated node, which means the diagram may correspond to a forest of more than one tree (figure 17). Thus, the set of diagrams for non-binary trees is not closed under the sandwich product. Whether there are subfamilies of non-binary trees for which the product is defined is an interesting further question.

Figure 17

Top: a sandwich product involving a diagram from a non-binary tree that results in a diagram that does not correspond to a tree, because it has an isolated node, but to a forest with two components. Bottom: the same sandwich product showing the trees and forest involved. Likewise, the sandwich product allows the multiplication of two forests relative to a third. As with non-binary trees, there will be some products that are not defined, because the numbers of nodes in the diagrams do not match (if the numbers of non-leaf vertices in the forest are not equal). But additionally, even if this is satisfied so that the diagram product is defined, the product may not result in a forest. And this also applies to sandwich products of non-binary trees: the product of two non-binary trees, or forests, relative to a third, may result in a diagram that does not even represent a forest, in that it violates the condition in theorem 6.5(iii). An example is shown in figure 18.

Figure 18

A sandwich product involving diagrams from two non-binary trees relative to a third, and which results in a diagram that does not define a tree or a forest, because it violates the condition in theorem 6.5(iii). This product diagram has , (one more than the number of singleton nodes) and , so does not satisfy .

Discussion

The link between phylogenetic trees and algebraic structures such as Brauer monoids and partition monoids that we have described gives rise to a wealth of questions that need further exploration. To begin, some opportunities for further development were raised by Holmes and Diaconis in 1998 [5]. For instance, they suggest the representation of a tree as a matching that they introduced (and treating the matched pairs as 2-cycles) allows for the use of multiplication in the symmetric group to create a random walk in tree space. They also suggested the use of multiplication of matchings through use of the Brauer algebra, that we have developed further in this paper in a direction they perhaps did not anticipate (by preserving the leaves along the top of a diagram to create an unbalanced but biologically interpretable model). The developments here, representing the matching (or partition more generally) as a Brauer or partition diagram, allow a more targeted random walk to be defined that preserves certain key structures of the trees. That is, a random step can be performed by acting on the top or bottom of a diagram by an element of or , and allow movement along an -class or -class, which preserves certain structural properties of the tree, as described in §4. There are many further questions that warrant exploration, some of which we list here There are several opportunities to extend the ideas in this paper in different directions. For instance, is it possible to represent phylogenetic networks within a diagram semigroup framework? And coming from the algebraic point of view, is there a role for other monoidal and categorical systems, such as those described in [20] (compare also Loday [29]) to play within phylogenetics, or other scientific and combinatorial problems? Questions about products of trees in the sandwich semigroup We have defined a way to multiply two binary trees relative to a third, via the sandwich product. How do features of trees (such as their balance) behave when they are multiplied together? Are some properties of trees ‘closed’ under multiplication? If two trees share the same property, when does their product share the same property? A feature of the sandwich semigroup construction is that it creates a product relative to a fixed tree. Is there some way to exploit this feature in phylogenetics? For instance, is it possible that gene trees may be constrained to be inside some neighbourhood defined by the product relative to the species tree (see [27] for discussion of the challenges understanding this relationship)? The sandwich semigroup product works naturally for binary trees, whose Brauer diagrams have predictable numbers of nodes. Given the comments in §6b and figure 18, are there subclasses of forests (other than binary trees) for which the product is defined and closed? Questions about the regular subsemigroup To what extent are features of trees preserved within the regular subsemigroup corresponding to the tree? Using balance as an example again, are the trees in the subsemigroup all close to being balanced under one of the standard balance measures? How do topological operations on a tree, such as the NNI, interact with the structures described here? For instance, is the NNI neighbourhood of a tree contained in the regular subsemigroup of the sandwich semigroup at ? Or as mentioned at the end of §5, are trees of rank 0 (with maximal cherries) always in the regular subsemigroup? Other links to tree space As noted in the Introduction, Holmes and Diaconis raised the prospect of using the product of trees to randomly move around tree space or search for an optimal solution to a problem. The properties of this random walk, or one extended to non-binary trees or forests, are unknown. If we restrict attention only to Brauer diagrams that are planar, that is, for which there are no crossings of lines, then we obtain another closed algebraic structure. That is, in the sandwich semigroup corresponding to a tree with planar diagram, the product of two planar diagrams will remain planar. Any diagram can be made planar by relabelling the leaves, so would this semigroup describe some sort of canonical representatives of tree space? Can operations on trees, such as edge-cutting that take a tree and produce a forest (but also even tree rearrangement operations such as NNI), be implemented by a sandwich product, or an action by an element of the partition monoid? For instance, can sandwich products such as that seen in figure 17 be controlled systematically? There is an interesting correspondence between ‘augmented perfect matchings of containing wiggly lines’ and phylogenetic trees on leaves and internal and root vertices, recently described in [28]. It would be interesting to investigate how such augmented matchings link to diagrams, and the set of partitions described here for such trees in theorem 6.3. Other links to semigroup theory and the Brauer algebra Within semigroup theory, and indeed broadly within other algebraic areas, the idempotents (elements for which ) play a very important role. What are the idempotents within the sandwich semigroup, and what relationships do the corresponding trees share? A formula for the number of idempotents in the sandwich semigroup is known [20, Theorem 6.18]: for instance, for a six-leaf tree with three cherries (rank 0), there are 45 idempotents in the corresponding sandwich semigroup (which is also the number of rank 0 trees on six leaves). A characterization could be illuminating. Narrowing it down further, the mid-identities (elements that are idempotents when regular but satisfying additional conditions) could have a concrete phylogenetic relationship to the original tree. It would be interesting to study further the restricted Green’s relations in which the action on left and right is from a subgroup or subsemigroup (like our action in §5). The general relations in the Brauer algebra (see appendix A) contain a central generator that we ignore by setting it to be 1. In the algebra, it tracks loops that can occasionally be generated by diagram concatenation, as occurs in the example in figure 2. Is there phylogenetically relevant information that can be captured by loops arising in products, and that might benefit from use of the full ? Clearly, there is opportunity for exploration and development of this approach at an algebraic, combinatorial, phylogenetic and computational level. Diagram semigroups, monoids, algebras and categories have found numerous diverse and powerful applications within mathematics and physics, and it is exciting to think that they may open new doors to phylogeneticists.

2 in total

1. Matchings and phylogenetic trees.

Authors: P W Diaconis; S P Holmes
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

Review 2. The inference of gene trees with species trees.

Authors: Gergely J Szöllősi; Eric Tannier; Vincent Daubin; Bastien Boussau
Journal: Syst Biol Date: 2014-07-28 Impact factor: 15.683

2 in total