Literature DB >> 27493765

State-space reduction and equivalence class sampling for a molecular self-assembly model.

Daniel M Packwood¹, Patrick Han², Taro Hitosugi³.

Abstract

Direct simulation of a model with a large state space will generate enormous volumes of data, much of which is not relevant to the questions under study. In this paper, we consider a molecular self-assembly model as a typical example of a large state-space model, and present a method for selectively retrieving 'target information' from this model. This method partitions the state space into equivalence classes, as identified by an appropriate equivalence relation. The set of equivalence classes H, which serves as a reduced state space, contains none of the superfluous information of the original model. After construction and characterization of a Markov chain with state space H, the target information is efficiently retrieved via Markov chain Monte Carlo sampling. This approach represents a new breed of simulation techniques which are highly optimized for studying molecular self-assembly and, moreover, serves as a valuable guideline for analysis of other large state-space models.

Entities: Chemical Disease Species

Keywords: Markov chain Monte Carlo; model reduction; self-assembly

Year: 2016 PMID： 27493765 PMCID： PMC4968457 DOI： 10.1098/rsos.150681

Source DB: PubMed Journal: R Soc Open Sci ISSN： 2054-5703 Impact factor: 2.963

Introduction

The oft-cited quote ‘We are drowning in information but starved for knowledge’ implies that information has become so abundant that it is nearly impossible to draw useful conclusions from it. The availability of affordable, high-performance computational resources allows for the simulation of sophisticated models with enormous state spaces. A researcher can, therefore, generate huge volumes of data with relatively little effort. On the other hand, the ‘target information’ desired by a researcher may be so deeply embedded in this dataset that it cannot be isolated without the use of elaborate statistical analysis. We will refer to these situations as large state-space problems. Techniques that efficiently recover target information from such large datasets are subject to intense research, and will remain of high priority so long as our hunger for facts and figures remains unsatisfied. As a case in point, we consider the phenomenon of molecular self-assembly, in which molecules deposited on a surface spontaneously assemble into highly ordered structures [1-3]. Figure 1a illustrates the fabrication of graphene by self-assembly of molecules on a copper surface [4]. Upon heating, the molecules assemble into chain-shaped ‘islands’ (figure 1b,c), and on further heating a chemical reaction occurs between molecules, resulting in graphene (figure 1d). Island formation is a necessary ‘stepping stone’ for graphene formation, and one goal of this experiment is to identify conditions (temperature, type of molecule and type of surface) that support island formation. However, this experiment also yields considerable additional information, including the positions of the islands on the surface, the distances between the islands and so on. This information does not help us to identify island-forming conditions, although it may be interesting in other contexts. We are faced with the following problem: build a model for the molecular self-assembly process and then estimate the probability of forming particular islands (the ‘target information’) without estimating any other properties of the model. Such a model should possess a very large state space in order to incorporate every conceivable outcome of the experiment in figure 1. Molecular self-assembly, therefore, provides an example of a large state-space problem and is of sufficient scientific importance to warrant attention.

Figure 1.

Formation of graphene by molecular self-assembly. (a) Self-assembly process represented by chemical formulae. (b–d) Scanning tunnelling microscopy images showing each step of the molecular self-assembly process. Numerous models and methods have been employed in the chemistry and physics literature to study the molecular self-assembly process. Some recent reports have described molecular dynamics simulations [5-10], density functional theory [11], kinetic Monte Carlo simulations [12], stochastic models [13], graph theory [14] and mean-field models [15], to name a few. However, the possibility of studying self-assembly by selective retrieval of target information as described above has not been considered in this literature. In fact, the need for such a study seems to be particularly acute for the case of molecular self-assembly. For example, while molecular dynamics simulations should eventually yield reliable statistics on island formation, the simulation times required for the entire state space to be sufficiently sampled are inaccessible with modern computational power. The enormous state space of typical molecular self-assembly models, therefore, provides a barrier to both data collection and data analysis. Specialized mathematical techniques that can astutely access the target information embedded in a molecular self-assembly model would, therefore, provide a new generation tools for studying molecular self-assembly. Moreover, such techniques would provide useful guidelines for developing similar methods for other types of large state-space problems. In this paper, we present the equivalence class sampling (ECS) technique for acquiring target information from a simple molecular self-assembly model. In the ECS technique, the state space of the model is partitioned into equivalence classes via an appropriate equivalence relation. Then, Markov chain Monte Carlo (MCMC) sampling [16] is performed on the reduced state space H, where H is the collection of equivalence classes. In this paper, two states belong to the same equivalence class if and only if they differ only in the position or orientation of their islands on the surface (precise details are provided in the main text). Recall that the target information for molecular self-assembly relates to the ‘kinds of islands’ that form under a given set of conditions. In ECS, all model configurations that contain the ‘same kinds of islands’ are placed into the same equivalence class. The reduced state space H therefore lacks all information unwanted by the researcher, such as the distance between islands or positions of islands in the surface. From a physical point of view, it can be shown that ECS directly samples the model properties with respect to a free energy, which is most desirable for any method which calculates equilibrium (stationary) physical properties. The price of these benefits is that ECS (as developed in this paper) requires a ‘low-coverage approximation’, in which the density of molecules on the surface is very small. However, such conditions are typical in experiments which study island formation via molecular self-assembly. Molecular self-assembly processes occurring under high-coverage conditions tend to produce ‘two-dimensional crystals’ rather than molecular islands [8,9], and are, therefore, not concerned with island formation. The ECS concept has been explored for characterization of graphical models of statistical dependence (Bayesian networks) [17-23]. However, the results obtained in these studies do not seem to directly apply to the molecular self-assembly problem, and new developments of the ECS technique are presented here. In particular, we develop and characterize the extension–reduction chain, a Markov chain with state space H, which in turn is used to generate proposals in the MCMC part (specifically, for a Metropolis–Hastings (MH) algorithm) of the ECS technique. This paper is organized as follows. In §2, we introduce the self-assembly model used in this study, and in §3, we introduce the ECS approach for this problem. In §4, we present some simulation results, and in §5, we draw conclusions.

Chiral block assembly model

The chiral block assembly (CBA) model is a simple model for the molecular self-assembly process described in figure 1. Consider a finite square lattice L of dimension d × d, where d is an integer. N blocks reside on the lattice (figure 2a), where N is a fixed and given integer. A block is a square subset of the lattice that contains 3 × 3 points. The centre-most point of the block is called the coordinate of the block. The remaining eight points belong to the perimeter of the block. The perimeters of each block are decomposed as follows. Let t be an arbitrary block and let F denote the perimeter of t. Let F1 (respectively, F2, F3, F4) be the three points in t that lie above (respectively, to the left of, below, to the right of) the coordinate of t. The sets F1, F2, F3 and F4 are called faces, and F is the union of its faces. The block t is said to have one of two types of orientations, o1 and o2, and two face types, m1 and m2. If t has orientation o1, then faces F1 and F3 are of type m1, and faces F2 and F4 are of type m2. If t has orientation o2, then faces F1 and F3 are of type m2 and faces F2 and F4 are of type m1. Two blocks are said to touch at their faces if the intersection of their faces contains more than one point. In this case, the two blocks are said to be neighbours. Note that this definition excludes cases where two blocks only share a ‘corner’. Two blocks are said to overlap if their coordinates are contained in their intersections. The blocks may be arranged on the lattice in any way and can touch at their faces, but no two blocks may overlap.

Figure 2.

(a) Diagram of the chiral block assembly model (see text). (b) Pair of rotationally isomorphic states. (c) Canonical representation for the equivalence class containing the states in (b).

(a) Diagram of the chiral block assembly model (see text). (b) Pair of rotationally isomorphic states. (c) Canonical representation for the equivalence class containing the states in (b). To define an interaction energy between neighbouring blocks (pairs of blocks that touch at their faces), we first classify the ways in which neighbours can be aligned with respect to each other. These alignments are referred as P, S and T and the rules for assigning the alignments are described in figure 2a. Note that the rules or assigning alignments are independent of the orientation of the block. For each pair of neighbouring blocks i and j, we assign an interaction energy , where r is the alignment of blocks i and j (S, P or T), and m (m) is the face of block i (block j) which is touching the face of block j (block i). The condition is enforced. The collection of these interaction energies is called the force field. The adjective ‘chiral’ refers to the fact that the alignment between any two neighbours does not possess mirror symmetry. This feature occurs in many real cases of molecular self-assembly on metal surfaces [4,24]. The CBA model is a special case of Winfree's tile assembly model [25-27]. Without further mention, we will assume periodic boundary conditions on the CBA model. This is achieved by replacing the lattice L with the lattice L + 1 = X + 1 × Y + 1, where X + 1 = Y + 1 = {1, 2, … , d, d + 1}, and then, imposing that the lattice points 1 and d + 1 in X + 1 are identical and that lattice points 1 and d + 1 in Y + 1 are identical. This results in the torus T. The nomenclature lattice and dimension will be taken to mean the set X × Y and the integer d, respectively. A configuration is any choice of block positions and orientations that satisfies the conditions for the CBA model (no overlap of blocks). Let Ω be the collection of all configurations. For each , we assign the probability where kB and T0 are positive constants (the Boltzmann constant and temperature, respectively) and ε(σ) is the sum of the interaction energies for each pair of neighbouring blocks in σ. The normalizing constant in (2.1) is assumed unknown. A state is any choice of coordinates for the blocks that satisfy the conditions of the CBA model. We can write where T is the state space and O is all choices of orientations for the blocks in T. For each b ∈ T, we can assign the probability where is the sum of interaction energies for each pair of neighbouring blocks in the configuration . Accordingly, we can write where is the probability of the blocks having orientations given that the blocks are arranged according to b.

Equivalence class sampling

We wish to estimate the stationary state properties of the CBA model by sampling states from T and orientations from O according to the target distributions and , respectively. We will focus exclusively on the problem of sampling from T, because the problem of sampling orientations (which is equivalent to sampling the Ising model) has been studied thoroughly in other contexts [28].

State-space reduction and Markov chain Monte Carlo

The ECS technique involves two stages. In the first stage, the state space T is reduced in such a way that some ‘target information’ is preserved and the remaining superfluous information is eliminated. For the self-assembly problem, the target information pertains to the kinds of islands that are likely to appear according to (2.3). We, therefore, begin by clarifying the wording ‘kinds of islands’ in terms of the CBA model. Let b be an arbitrary state of the CBA model and denote the blocks in b by . An island in b is a subset of blocks I from b such that I.1. I cannot be split into two disjoint subsets I1 and I2 in which there is no pair t ∈ I1 and t ∈ I2 which are neighbours, and I.2. There does not exist a block t ∈ I such that t ∈ b and I′ = I ∪ t satisfies I.1. I.1. essentially says that an island cannot be split into two disjoint parts, and I.2. asserts than a subset of an island is not an island. Consider the two states b1 and b2 shown in figure 2b. Imprecisely, we can say that both b1 and b2 contain ‘same kinds’ of islands because if we rotate the islands in b1 and translate them about the lattice, we can achieve b2. More precisely, two states b1 and b2 are said to be rotational isomorphs if RI.1. Each island in b1 can be mapped onto an island in b2 by rotations and translations alone, and RI.2. this mapping is one-to-one and onto between the islands of b1 and the islands of b2. Similarly, we say that two islands I1 and I2 are rotational isomorphs if I1 can be mapped onto I2 with rotations and translations alone. We can, therefore, partition T into equivalence classes E1, E2, … ,E, where any two states b1 and b2 belong to the same equivalence class if and only if b1 and b2 are rotational isomorphs. We refer to the collection as the reduced state space. We also have the following. Let b1 and b2 be two states from the CBA model. If b1 and b2 belong to the same equivalence class, then μ(b1) = μ(b2), where μ is the probability measure in equation (2.3). Proof of theorem 3.1 is provided in appendix A.3. According to theorem 3.1, we can assign the probability to equivalence class E ∈ H, where n(E) is the number of states that belong to equivalence class E (the degeneracy of E), and μ(E) is the probability from equation (2.3) calculated for any member of E. The problem of ‘what kinds of islands appear at equilibrium’ can be stated as ‘which equivalence cases occur with relatively high probability v’. In order to obtain this target information, we only need to consider the reduced state space H, rather than the (considerably larger) space T. Note that, in contrast with the state space T, the reduced state space H does not grow as the dimension d of the lattice increases (providing that the number blocks N is fixed). Thus, for very large d, the reduced state space H will be considerably smaller than the full state space T. In passing, we note an important physical aspect of the ECS approach. It can be shown that where A(E) can be regarded as the ‘free energy of equivalence class E’. This is discussed in detail in appendix A.1. ECS, therefore, directly incorporates thermodynamics into the MCMC sampling framework, which is a highly desirable feature when the equilibrium (stationary) state properties of a molecular self-assembly model are of interest. In order to implement ECS in a computational environment, it is necessary to use a canonical representation for each equivalence class in H (figure 2c). In this study, the canonical representation C of equivalence class E is generated by choosing an arbitrary state b from E, and letting the elements of C be the islands from b as well as one empty element. The empty element is necessary for our implementation of MCMC as described in the following section. We do not distinguish between canonical representations that can be made equivalent by rotational isomorphism of their elements or re-ordering of their elements. Therefore, a canonical representation C is a unique representation of the equivalence class E. The number of islands in equivalence class E is the number |C| − 1.

Extension–reduction chain

To sample equivalence classes via the MCMC algorithm, we will employ the MH algorithm. We use Z to denote the Markov chain generated by the MH algorithm (the ‘MH chain’), and Y to denote the Markov chain used to generate proposals for the MH algorithm (the ‘proposal chain’). In this section, we introduce a suitable proposal chain called extension–reduction chain. To define the extension–reduction chain, we first construct a transformation K that maps equivalence classes to other equivalence classes at random. Fix an equivalence class E with canonical representation C and consider the following actions. The resulting set C is a canonical representation for an equivalence class E which may or may not be identical to E. This transformation is denoted with the symbol K, and is referred to as the extension–reduction transformation. An example of an outcome of the extension–reduction transformation is shown in figure 3. Note that by virtue of equation (3.5), the empty island may be replaced with a single block with finite probability for any occurrence of the extension–reduction transformation. In such a case, we can intuitively think of the transformation as creating a new island by removing a block from an existing island.

Figure 3.

Example of an extension–reduction transformation on equivalence class E. The island in the red box is extended and the island in the dotted red box is reduced.

T.1. Choose island k from C with probability where is the number of blocks in island k and α1 is a positive, non-zero constant. Now, choose island j from C with probability where α2 is a positive, non-zero constant. T.2. For the island k chosen in step T.1, let be the set of all islands that can be obtained by deleting a single block from island k. Each of the islands in must satisfy the conditions I.1 and I.2 outlined earlier. Now, if there are pairs of islands in that are rotationally isomorphic, delete one member of the pair and repeat until there are no pairs of islands in which are rotationally isomorphic. Denote the resulting set by . is called the reduction set of island k. T.3. For the island j chosen in step T.1, let be the set of all islands that can be obtained by adding a single block to island j. Then, delete islands from until there are no two pairs of islands in which are rotationally isomorphic. Denote the resulting set by . is called the extension set of island j. T.4. Replace island k in C with a uniformly chosen element of , and replace island j in C with a uniformly chosen element of . This produces a set . Add one empty island to if there are no empty islands in . If there are two empty islands in , delete one empty island. Set C = . Example of an extension–reduction transformation on equivalence class E. The island in the red box is extended and the island in the dotted red box is reduced. The extension–reduction transformation K maps equivalence classes to equivalence classes at random. The probability is evaluated in appendix A.5. With this characterization of the extension–reduction transformation, we can generate a Markov chain with state space H as follows. Choose an arbitrary equivalence class E1 = C, Y2 = K(Y1), Y3 = K(Y2), and so on. Then the sequence Y1, Y2, Y3, … is a υ-irreducible and aperiodic Markov chain on H with transition probability . Theorem 3.2 is proved in appendix A.3. The Markov chain Y in theorem 3.2 is the extension–reduction chain. The parameters α1 and α2 in equations (3.4) and (3.5) can be used to control the evolution of the Y. Y does not have stationary distribution υ, and to perform ECS we therefore use the transition probability in equation (3.6) to simulate the standard MH chain Z.

Approximation for the degeneracy factor

It does not appear possible to provide closed-form expressions for the degeneracy factor n(E) in equation (3.2) in the general case. However, we can form a good approximation to it with the following two-step method. (a) Compute the degeneracy factor exactly for a special case in which each island only covers a single lattice point. (b) Show that the degeneracy factor for the actual CBA model converges to the degeneracy factor obtained from this special case in a ‘low-coverage limit’, in which the lattice is extremely large compared with the number of blocks. The concept of a semicanonical representation is needed to carry out step (a) described above. Choose an equivalence class E and let b1 and b2 be any two states that belong to E. Let U and V be the set of islands belonging to b1 and b2, respectively. Suppose that U and V can be made equivalent by translation of their islands without rotation. U and V are called semicanonical representations of equivalence class E. We do not distinguish between semicanonical representations that are equivalent under translation of islands, and therefore U and V are considered identical semicanonical representations. A canonical representation is also a semicanonical representation; however, two sets of islands corresponding to the same canonical representation may correspond to different semicanonical representations. Now, fix a semicanonical representation V. We can partition the set of islands in V into subsets , , … , , where two islands belong to the same subset if and only if they can be made equivalent by translation about the lattice. We will refer to the quantity as the degeneracy of the semicanonical representation V, where m is the number of islands contained in V, and a1, … , a are the number of islands belonging to the subsets , , … , described above. We now consider the meaning of n(V) in equation (3.7). n(V) is the number of ways to divide the d2 points of the torus T into r + 1 subsets , , , … , , where permutation of any points belonging to the same subset is does affect n(V). The sets , , … , are exactly those defined in the previous paragraph, whereas the set can be thought of as the set of d2 – m ‘unoccupied’ points of the lattice. Equation (3.7) therefore counts the number of ways to assign islands from a semicanonical representation to the lattice, but in such a way that the islands only cover a single point of the lattice while retaining information on their shape. The number where Γ(E) is the set of all distinct semicanonical representations available to equivalence class E, can be interpreted as a degeneracy factor of the case where the islands only cover a single point of the lattice. Considering the actual degeneracy factor n(E) in equation (3.2), we in general have that n(E) < n0(E). However, in appendix A.4, we show that for any equivalence class E, as d → ∞ with N (the number of blocks) fixed. Following step (b) mentioned above, we therefore employ the low-coverage approximation in order to perform ECS in the following sections. The low-coverage approximation is mathematically permissible for the following reason. Let Z and Z0 be the MH Markov chains when sampling from the actual distribution υ in equation (3.2) and when using the low-coverage approximation, respectively. Without loss, we can take Z and Z0 to be υ- and υ0(E) = n0(E)μ(E)-distributed random variables, respectively. Equation (3.9) then implies that max∈|υ0(E) − υ(E)| → 0 and hence that Z converges to Z0 in distribution as d → ∞. From a physical point of view, the low-coverage approximation in equations (3.8) and (3.10) has connections with the statistical mechanical concept of indistinguishable and distinguishable particles. This suggests that under the low-coverage approximation, the CBA model might be regarded as a kind of ‘two-dimensional lattice gas’ model. This very natural physical picture further supports the application of the low-coverage approximation in ECS. Note that study of MCMC Markov chains with approximate transition kernels appears to related to research on the ‘pseudo marginal’ MCMC algorithm [29-32].

Computational implementation

We have written a code called IslandPrediction which performs ECS on the CBA model as described above. To enhance convergence rates, the IslandPrediction code incorporates parallel tempering into the MH algorithm, exactly as described in reference [33]. The replica MH chains differ in the temperature of their stationary distribution, and each MH Markov chain has proposals generated according to its own extension–reduction chain, where these extension–reduction chains are simulated independently of one another. IslandPrediction is available as an R script [34,35] and as a program written in Go [36] (both language codes available upon request). The Go version of IslandPrediction is intended for multicore environments. A striking feature of IslandPrediction is that the MH Markov chain converges relatively quickly with the length of the extension–reduction chain. Using the conditions reported in table 1, we observe burn-in periods of around 20 000–30 000 steps of the MH algorithm and an asymptotic acceptance rate of between 0.1 and 0.5 (figure 4). Reasonable estimates of quantities such as island size distribution can be obtained after 100 000 steps of the MH algorithm, assuming that the burn-in period has been neglected. The simulations presented in the following section were actually run for 106 steps, with the first 5 × 105 steps removed before analysis. Results acquired with the shorter simulation criterion of only 100 000 steps described above showed no significant difference from the longer simulations presented in the following section. Computation of the transition probabilities in equations (A 25)–(A 28) is the major performance bottleneck in these calculations. These calculations can take between 0.5 to about 5 s depending on the number of and size of the islands in the canonical representations, and running the entire simulation for 100 000 steps of the MH algorithm typically takes around 16 h for the case of 10 blocks when the R code is used on a typical desktop computer. By contrast, simulations of 100 000 and 106 steps of the MH algorithm only take around 30 min and 3 h, respectively, when the Go version of IslandPrediction is run on a 16 core cluster.

Table 1.

Parameters used in the simulations described in §4. , and , respectively, refer to the lowest temperature, highest temperature and temperature step used for parallel tempering.

εP,1,1	−0.1 eV	εP,1,2	0.03 eV
εS,1,1	−0.08 eV	εS,1,2	0.02 eV
εT,1,1	−0.05 eV	εT,1,2	−0.01 eV
εP,2,2	0.05 eV	d	50
εS,2,2	0.025 eV	N	10
εT,2,2	0.10 eV	α1, α2	1, 1
T01	100 K	T02	200 K
T0step	10 K

Figure 4.

Plot of the quantity μ in equation (3.2) as a function of Z, the state of the MH chain at step n, for a typical simulation of ECS at 100 K (see text for details). The red line corresponds to a, the acceptance rates up to step n. Plot (a) shows the first 105 steps of the simulation, and plot (b) shows the entire 106 steps of the simulation. Parameters used in the simulations described in §4. , and , respectively, refer to the lowest temperature, highest temperature and temperature step used for parallel tempering.

Simulation results

We now present results obtained using the ECS procedure with the conditions shown in table 1. The block orientations of any interesting islands were optimized individually using a standard technique used to study Ising models (see appendix A.1). Figure 5a shows randomly chosen equivalence classes that appeared beyond the burn-in regime in our calculations at 100 and 150 K. At low temperatures, we find that equivalence classes almost always contain a single island, whereas at higher temperatures, these equivalence classes instead consist of multiple, smaller islands. This observation is quantified in figure 5b, which shows the island size distributions computed from the above simulations. The island size distribution f is the probability of an island containing k blocks appearing in an equivalence class chosen randomly with respect to equation (3.2). At high temperatures, the island size distribution f decays rapidly with the number of blocks k in the island. A similar decay was reported for the graphene size distribution obtained in the experiment described in fig. 1 [4], which suggests that in these experiments, graphene was produced under conditions in which many small islands are present.

Figure 5.

(a) An equivalences class randomly chosen beyond the burn-in period for the simulation performed at 100 K (top row), and for the simulation performed at 150 K (bottom row). The insert explains how the graphics should be interpreted. (b) Island size distribution estimated between 100 and 200 K (see text). (c) Alignment distribution estimated between 100 and 200 K (see text for details). The sample of equivalence classes used to generate this figure is available as the electronic supplementary material. The behaviour of the model is most interesting at low temperatures. As temperature decreases, the distribution spreads towards larger island sizes. At about 120 K, a second sharp mode starts to appear at k = 10 blocks. The occurrence of this mode shows that, at low temperatures, once every block gathers into a single island, the blocks tend to remain in this state. In order to move away from this state, one of the blocks in this island must break away, resulting in a state containing two islands. One of these islands contains nine blocks, and one contains one block. From a physical point of view, this second island contributes some entropy to the model; however, it also increases the overall energy of the model. This energy cost is particularly important at low temperature, and so this state either quickly reverts back to being a single large island, or another block departs from the island of size nine and joins the island containing one block. The appearance of two modes is therefore attributed to the fact that, once one block detaches from a large island, other blocks can be expected to detach in order to stabilize this block. As temperature decreases further, the blocks become increasingly stable when they are all part of a single island, and the probability density becomes concentrated on islands of size 10. Figure 5c shows the alignment distribution estimated from the above simulations. This corresponds to the expected fraction of P-, S- or T-type alignments observed in a randomly chosen equivalence class. At very low temperatures, almost all alignments are of the P-type. This quantifies the observation that large, chain-shaped islands tend to appear at low temperature, in agreement with the experimental observations described in figure 1. As temperature increases, the fraction of S-type alignments increases. This is a consequence of the entropic contribution to the model (described by the degeneracy term in equation (3.8)), which favours the appearance of islands of low symmetry.

Conclusion

Large state-space problems are rapidly becoming commonplace in most fields of science. For these problems, it is essential to have available methods that can efficiently extract ‘target information’ without burdening the researcher with volumes of superfluous information. This paper has presented a simple molecular self-assembly model that contains the basic elements of a large state-space problem, and developed the ECS technique for predicting a particular piece of target information (the ‘kinds of islands’ that appear with high probability in the model) with high precision. The key feature of ECS is that the model state space is reduced to a set of equivalence classes, where two states b1 and b2 belong to the same equivalence class if and only if b1 can be transformed into b2 under a suitably chosen map. By development of a special Markov chain on this reduced state space, we achieved efficient estimation of the target information in the model via MCMC sampling. While we have considered the special (but important) example of molecular self-assembly, we believe that the approach used in this study provides a ‘line-of-thinking’ also for other kinds of large state-space problems. Two interesting mathematical problems are encountered in the development of ECS. In order to generate proposals for the MCMC sampling component of ECS (assuming that the MH algorithm is employed), one needs a Markov chain with state space H and calculable transition probabilities, where H is the collection of equivalence classes (reduced state space). Moreover, good estimates of the quantities n(E), the number of states contained in equivalence class E, are needed to ensure that the MCMC algorithm samples from a distribution that is sufficiently close to the actual target distribution of the model. In this paper, we presented the extension–reduction chain as one example of such a Markov chain. However, there is motivation to develop other Markov chains and to study the convergence properties of the algorithms in detail. In particular, the types of Markov chains available for ECS are expected to depend upon the specific problem under consideration. We also presented an approximation for n(E) in the case of low-coverage of molecules on the surface, and showed that this approximation converges as desired in the limit of zero surface coverage. While many experimental studies of molecular self-assembly use low-coverage conditions, it may also be interesting to develop other approximations to the degeneracies n(E) that hold in the high-coverage limit. While not so relevant for the generation of small molecular islands, experiments in the high-coverage limit are widely used to generate so-called two-dimensional crystals via molecular self-assembly [24]. The high-coverage regime is also of academic interest as it often involves relatively unintuitive self-assembly behaviour [9]. Another open problem for implementing ECS algorithms concerns the development of efficient algorithms for detecting rotational isomorphism between islands and configurations of self-assembly models. This computation is a bottleneck in our current implementation of ECS. Perhaps, the greatest merit of studying molecular self-assembly via ECS is that the routes to improvement are very clear. We will explore them in detail in future research.

Table 2.

The nine possible compositions for the extension–reduction transformation. The compositions are referred to as ‘classes’. denotes the empty set.

class 1	{−1, −1, +1, +1}	class 6	{−1, +1, +1}
class 2	{−1, −1, +2}	class 7	{−1, +2}
class 3	{−1, −1, +1}	class 8	{−1, +1}
class 4	{−2, +1, +1}	class 9	∅
class 5	{−2, +1}

Table 3.

Analysis of the M = 0 cases (see proof of the theorem in section A.5).

condition	reaction representation	realizable
A≠BC≠D	A→CB→D	yes
A≠BC=D	A→CB→C	yes
A≠BC≠D=∅	A→CB→∅	yes
A=BC≠D	A→CA→D	yes
A=BC=D	A→CA→C	no^a
A=BC≠D=∅	A→CA→∅	yes
∅=A≠BC≠D	∅→CB→D	yes
∅=A≠BC=D	∅→CB→C	yes
∅=A≠BC≠D=∅	∅→CB→∅	no^b

aC cannot be in both the extension set and reduction set of A.

b and imply that both B and C contain only one block, and hence that B and C are rotational isomorphs. But this violates the condition .

Table 4.

Analysis of the M = 1 cases (see proof of the theorem in section A.5). We consider the two subcases and . For the case , all cases with or A = B are not realizable, as the former implies that B is reduced to B, and the latter implies that B is extended to B. For similar reasons, the cases and are not realizable when D = A.

condition	reaction representation	realizable
C=BA≠BC≠D	A→BB→D	yes
C=BA≠BC≠D=∅	A→BB→∅	no^a
C=B∅=A≠BC≠D	∅→BB→D	no^b
C=B∅=A≠BC≠D=∅	∅→BB→∅	yes
D=AA≠BC≠D	A→CB→A	yes
D=AA≠BC≠D=∅	A→CB→∅	no^c
D=A∅=A≠BC≠D	∅→CB→∅	no^d
D=A∅=A≠BC≠D=∅	∅→CB→∅	no^d

aIf B is reduced to the empty island, the B must contain one block only, which incorrectly implies that .

bExtension of the empty island to B implies that B contains one block only. Reduction of B to D then incorrectly implies that D is an empty island.

c and incorrectly implies that .

dReduction of B to the empty island, and extension of the empty island to C imply that both B and C contain a single block, and hence that . But this violates the condition .

Table 5.

Analysis of the M = 2 case (see proof of the theorem in section A.5).

condition	reaction representation	realizable
A≠B	A→BB→A	yes

Table 6.

Transition probability for every possible composition of the extension–reduction transformation. Note that both reactions in class 8 contribute the same probability to the extension–reduction transformation.

class 1 {−1, −1, +1, +1}
A→CB→D	qk→j=nAnBqBA,DCk→j+qAB,DCk→j+qBA,CDk→j+qAB,CDk→j
class 2 {−1, −1, +2}
A→CB→C	qk→j=nAnBqBA,CCk→j+qAB,CCk→j
class 3 {−1, −1, +1}
A→CB→∅	qk→j=nAnBqBA,∅Ck→j+qAB,∅Ck→j
class 4 {−2, +1, +1}
A→CA→D	qk→j=nAnA−1qAA,CDk→j+qAA,DCk→j
class 5 {−2, +1}
A→CA→∅	qk→j=nAnA−1qAA,∅Ck→j
class 6 {−1, −1, +1}
∅→CB→D	qk→j=nBqB∅,DCk→j+qB∅,CDk→j
class 7 {−1, +2}
∅→CB→C	qk→j=nBqB∅,CCk→j
class 8 {−1, +1}
A→BB→D	qk→j=∑B∈Rkj,B≠AnAnBqBA,DBk→j+qAB,BDk→j
A→CB→A
class 9 {Ø}
∅→BB→∅	qk→j=∑B∈RkjnBqB∅,∅Bk→j+∑A,B∈Rkj:A≠BnAnBqAB,BAk→j+qBA,BAk→j
A→BB→A

11 in total

1. Self-assembly at all scales.

Authors: George M Whitesides; Bartosz Grzybowski
Journal: Science Date: 2002-03-29 Impact factor: 47.728

2. Theory of self-assembly of microtubules and motors.

Authors: Igor S Aranson; Lev S Tsimring
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2006-09-26

3. Communication: theoretical prediction of free-energy landscapes for complex self-assembly.

Authors: William M Jacobs; Aleks Reinhardt; Daan Frenkel
Journal: J Chem Phys Date: 2015-01-14 Impact factor: 3.488

4. Insights into carbon nanotube and graphene formation mechanisms from molecular simulations: a review.

Authors: A J Page; F Ding; S Irle; K Morokuma
Journal: Rep Prog Phys Date: 2015-03-09

5. Surface-induced optimal packing of two-dimensional molecular networks.

Authors: Guillaume Copie; Fabrizio Cleri; Younes Makoudi; Christophe Krzeminski; Maxime Berthe; Frédéric Cherioux; Frank Palmino; Bruno Grandidier
Journal: Phys Rev Lett Date: 2015-02-13 Impact factor: 9.161

6. Simulation of graphene nanoribbon aggregation and its mediation by edge decoration.

Authors: Jonathan D Saathoff; Paulette Clancy
Journal: J Phys Chem B Date: 2015-02-16 Impact factor: 2.991

7. Periodic ordering of clusters and stripes in a two-dimensional lattice model. I. Ground state, mean-field phase diagram and structure of the disordered phases.

Authors: J Pȩkalski; A Ciach; N G Almarza
Journal: J Chem Phys Date: 2014-03-21 Impact factor: 3.488