Literature DB >> 35371510

Unsupervised learning approaches to characterizing heterogeneous samples using X-ray single-particle imaging.

Yulong Zhuang^1,2, Salah Awel³, Anton Barty³, Richard Bean⁴, Johan Bielecki⁴, Martin Bergemann⁴, Benedikt J Daurer⁵, Tomas Ekeberg⁶, Armando D Estillore³, Hans Fangohr^1,2,4,7, Klaus Giewekemeyer⁴, Mark S Hunter⁸, Mikhail Karnevskiy⁴, Richard A Kirian⁹, Henry Kirkwood⁴, Yoonhee Kim⁴, Jayanath Koliyadu⁴, Holger Lange^10,11, Romain Letrun⁴, Jannik Lübke^3,10,12, Abhishek Mall^1,2, Thomas Michelat⁴, Andrew J Morgan¹³, Nils Roth^3,12, Amit K Samanta³, Tokushi Sato⁴, Zhou Shen⁵, Marcin Sikorski³, Florian Schulz^10,14, John C H Spence⁹, Patrik Vagovic^3,4, Tamme Wollweber^1,2,10, Lena Worbs^3,12, P Lourdu Xavier^1,3,10, Oleksandr Yefanov³, Filipe R N C Maia^6,15, Daniel A Horke^3,10,16, Jochen Küpper^3,10,12,17, N Duane Loh^5,18, Adrian P Mancuso^4,19, Henry N Chapman^3,10,12, Kartik Ayyer^1,2,10.

Abstract

One of the outstanding analytical problems in X-ray single-particle imaging (SPI) is the classification of structural heterogeneity, which is especially difficult given the low signal-to-noise ratios of individual patterns and the fact that even identical objects can yield patterns that vary greatly when orientation is taken into consideration. Proposed here are two methods which explicitly account for this orientation-induced variation and can robustly determine the structural landscape of a sample ensemble. The first, termed common-line principal component analysis (PCA), provides a rough classification which is essentially parameter free and can be run automatically on any SPI dataset. The second method, utilizing variation auto-encoders (VAEs), can generate 3D structures of the objects at any point in the structural landscape. Both these methods are implemented in combination with the noise-tolerant expand-maximize-compress (EMC) algorithm and its utility is demonstrated by applying it to an experimental dataset from gold nanoparticles with only a few thousand photons per pattern. Both discrete structural classes and continuous deformations are recovered. These developments diverge from previous approaches of extracting reproducible subsets of patterns from a dataset and open up the possibility of moving beyond the study of homogeneous sample sets to addressing open questions on topics such as nanocrystal growth and dynamics, as well as phase transitions which have not been externally triggered. © Yulong Zhuang et al. 2022.

Entities: Chemical

Keywords: XFELs; coherent X-ray diffractive imaging (CXDI); single particles

Year: 2022 PMID： 35371510 PMCID： PMC8895023 DOI： 10.1107/S2052252521012707

Source DB: PubMed Journal: IUCrJ ISSN： 2052-2525 Impact factor: 4.769

Introduction

X-ray single-particle imaging (SPI) is a method to reconstruct 3D structures of isolated nanoscale objects by collecting a large number of diffraction patterns using bright X-ray pulses. The diffraction patterns sample the object’s 3D Fourier transform along randomly oriented spherical slices, which enables a Fourier synthesis of the 3D structure as long as the orientations can be determined (Neutze et al., 2000 ▸). However, the X-ray pulses are bright enough to destroy the sample after each shot, and so each pattern is collected from a different particle. If the particles are reproducible up to the resolution of interest, then determination of the 3D structures is fairly straightforward, involving the determination of the orientation and incident X-ray fluence for each shot. Various algorithms have been proposed and implemented for this purpose, including the expand–maximize–compress (EMC) algorithm (Loh & Elser, 2009 ▸), which has been used for a number of experimental demonstrations (Ekeberg et al., 2015 ▸; Rose et al., 2018 ▸; Shi et al., 2018 ▸; Ayyer et al., 2019 ▸; Cho et al., 2021 ▸; Ayyer et al., 2021 ▸). Other methods utilizing intensity correlations have also been demonstrated experimentally (Kurta et al., 2017 ▸; von Ardenne et al., 2018 ▸). However, one of the challenges in analysing a serial dataset like that produced in an SPI experiment is the proper classification of patterns in terms of their structures. This is a necessary step in order to obtain a high-resolution structure, since the underlying assumption of the above-mentioned algorithms is that the particles are identical, which is never true in practice. On the other hand, the framework used for this classification can be used to study datasets where the heterogeneity is not just a drawback, uncovering the landscape of structural variations in the sample. This requires the detection of discrete classes of object shapes representing contaminants, aggregates etc. and even the detection of continuous deformations, depending on the scientific problem being studied. In past years, different machine learning algorithms have been developed for use in this classification of structural variability. These algorithms are usually applied to the patterns themselves using methods including spectral clustering (Yoon et al., 2011 ▸), support vector machines (Bobkov et al., 2015 ▸) and convolutional neural networks (Zimmermann et al., 2019 ▸; Ignatenko et al., 2021 ▸). In this work, we take an unsupervised learning approach to analyse an experimental dataset of more than 2.4 million diffraction patterns of nominally 42 nm cubic gold nanoparticles collected at the European XFEL (Ayyer et al., 2021 ▸). The goal is to study the ‘structural landscape’ – classifying the entire structural ensemble, including separating discrete contaminants, as well as revealing any continuous structural variations that may be present. We discuss two approaches, the first of which produces an embedding of the diffraction patterns based on their 3D structures. This approach can be applied without modification to most SPI datasets. The second is a generative method using variational auto-encoders (VAEs) which enables us to visualize the 3D structure of the particle at any point along its landscape. Both methods are applied to the dataset to study the continuous XFEL-induced deformation of the cubic nanoparticles to spheres.

Results

Experiment and dataset information

The dataset discussed in the rest of this work was collected as part of the experiment described by Ayyer et al. (2021 ▸), which we review in brief: The SPI experiment was performed with the megahertz-rate European XFEL (Decking et al., 2020 ▸) on the SPB/SFX (single particles, biomolecules and clusters/serial femtosecond crystallography) instrument (Mancuso et al., 2019 ▸) with 6 keV photons in pulses with an average energy of 2.5 mJ (2.6 × 1012 photons) measured upstream of the focusing optics. The adaptive gain integrating pixel detector (AGIPD) (Allahgholi et al., 2019 ▸) was placed 705 mm downstream of the interaction region to collect each diffraction pattern up to a scattering angle of 8.3°. In this analysis, we will use data from the central region of the detector up to 1.8°. The samples were nominally gold cubes with edge lengths 42 nm and were injected into the X-ray beam via an electrospray aerosolization aerodynamic lens-stack sample delivery system. For this sample, 34 197 950 frames were recorded, of which 2 451 068 were judged to contain sample diffraction (hit ratio ∼7.17%). Part of the dataset was collected with the European XFEL running at an intra-train repetition rate of 1.1 MHz, during which a high fraction of the diffraction patterns originated from spherical particles. This was found to be due to the pre-exposure of particles in the wings of the previous XFEL pulse in the train, which seemed to lead to melting/rounding. This effect disappeared when reducing the repetition rate to 0.5 MHz. As a result, the sample contains diffraction patterns of cubes, spheres and, potentially, cubes at different stages of melting/rounding.

Classification of the entire ensemble

There are two major challenges in understanding the sample ensemble from the diffraction data. First, many of the patterns are too weak (100–1000 photons per pattern) to apply single-shot analysis, as illustrated in Fig. 1 ▸(a). Second, the dataset is not structurally homogeneous, not just in that the sample set included non-cubic particles, but also that a fraction of the cubic particles suffered varied levels of melting due to pre-exposure by the previous pulse in the train.

Figure 1

(a) Examples of diffraction patterns in the data set. The colour scale maximizes at four photons. (b) The workflow of the CLPCA method. (c) The upper row shows some typical frame averages in the dataset on a logarithmic scale, while the lower row shows their corresponding real-space 2D density projections via phase retrieval. The real-space field of view is 132.4 nm.

We first developed a common-line principal component analysis (CLPCA) method to understand SPI datasets from an arbitrary ensemble of particles. Fig. 1 ▸(b) shows our workflow, which we discuss in detail below.

Generation of 2D averages via bootstrapping

Due to the weak scattering signal in single diffraction frames, the first step is to average similar 2D detector frames together (accounting for in-plane rotations), which was done with the 2D classification procedure implemented in Dragonfly (Ayyer et al., 2016 ▸). The code implements a modification of the EMC algorithm that classifies all 2D frames into a given number of averages (models, termed classes in Dragonfly). This averaging improves the signal-to-noise ratio, corrects for incident fluence variations and fills in detector panel gaps, at the potential cost of washing out some structural variations. In order to mitigate this last effect, one can use a very large number of models, but this can lead to instabilities in the iterative reconstruction and reduced signal-to-noise ratios in individual class averages. In order to get more pattern averages, a bootstrapping method was used by running the reconstruction five times with 200 models, each time with a random subset of 80% of the frames. In this way, each of the 1000 models are composed of a different group of similar 2D frames. In Fig. 1 ▸(c), we show an example of some typical 2D diffraction averages and corresponding projected electron-density maps after phase retrieval using a combination of the difference map and error reduction algorithms similar to that employed by Ayyer et al. (2019 ▸). This highlights the variety both in the samples and also in the diffraction patterns from the same samples in different orientations, such as the rotated versions of identical cubes in the first two columns of Fig. 1 ▸(c).

Common line 3D classification

The 2D EMC method can only group similar 2D frames together but, as mentioned above, diffraction patterns of the same object can look very different depending on the orientation. In order to understand the variations of structures in the sample, we need to classify the averages further through their 3D features, rather than considering them just as 2D images. At small scattering angles, each average is a Fourier transform of a projection of the target object at a given orientation. According to the Fourier slice theorem, this means that each average represents a slice through the 3D Fourier transform of the object. Any two patterns of different orientations from the same object should share a common intersection line (at larger angles, these lines become arcs), as shown in the illustration of Fig. 2 ▸(a). The similarity of the diffraction intensities along the best ‘common lines’ between two patterns should be correlated to the similarity of their 3D structures. For each pair of 2D averages we calculated the cross correlation (CC) between their radial intensity profile lines at different angles. We define the common-line similarity between two 2D averages as the value of the highest CC coefficient. This yields a common-line similarity matrix (CC matrix) with a size of N×N for N 2D intensity averages. We find that the relation of the CC value of one average to all other averages acts as a signature of its 3D shape, namely that all particles with the same shape, regardless of orientation, have high values between each other and lower values to distinctly shaped objects.

Figure 2

(a) An illustration of the common line between two patterns of similar-shaped objects but with different orientations. (b) The distribution of 1000 averages in the 3D CLPCA space. Different colours are manually divided groups from the first two components: contaminants (blue), cubes (red), transition (yellow) and spheres (green). Typical patterns for averages in each group are also shown on a logarithmic scale.

Common lines have been used previously for orientation determination (Shneerson et al., 2008 ▸; Singer & Shkolnisky, 2011 ▸; Yefanov & Vartanyants, 2013 ▸), where the optimal angles tell one how to fit the two slices in the 3D Fourier space. Here, we are interested in the similarity index of the common lines as a tool for structural similarity analysis, rather than the relative angles at which they are maximized. The 3D shape distribution was visualized using principal component analysis (PCA), the inputs to which are the row vectors of the CC matrix. Fig. 2 ▸(b) shows a 3D embedding of the 1000 averages. In the rest of this article, we refer to this 3D embedding as ‘CLPCA space’. Until this point, the entire procedure is fully unsupervised and can be run automatically for any SPI dataset. By comparing with 2D patterns and their locations in CLPCA space, we then qualitatively divided the CLPCA space into four groups as shown in Fig. 2 ▸(b), where red denotes cubes, green spheres, blue contaminants and yellow rounded cubes. In the embedded space, CLPCA-1 separates these assorted patterns from those originating from spherical/cubic particles, CLPCA-2 separates spherical and cubic particles, and for spherical particles, CLPCA-3 is associated with their diameter (see Appendix D ). In addition, we see a sequence where particles transition from cubes to spheres. This is a clear trace of the pre-melting processes observed in the experiment. Without using any a priori knowledge about the sample set, we are thus able to obtain a rough classification of the dataset according to their 3D structures. We also note from Fig. 2 ▸(b) that the dense cluster of cube patterns are correctly identified as being from the same-shaped particles, even though the patterns themselves vary extensively at different orientations. For the sake of simplicity we only perform the embedding with the PCA method, but other dimensionality reduction methods could also be used interchangeably. In Appendix A we show a comparison between PCA and other embedding methods, showing no strong preference in terms of separating structural classes for this dataset.

Absolute embedding of images

The CLPCA method provides a way of classifying unknown sample frames according to their 3D structures. But the ‘similarity’ used for generating the landscape is a relative concept, i.e. the distributions in the embedded space depend on the sample one chooses. This limits the classification and comparison of certain subsamples. For example, one might want to look into some subgroup in detail while still being able to relate them to the whole-sample embedding, or generate new sets of averages with bootstrapping. In addition, the time complexity of calculating the similarity matrix scales as the square of the number of models, which imposes a computational hurdle to using too many models at once. In order to get a quicker and more universal measure of frame features after obtaining the embedding of a typical set of averages, a neural network-based regression was used to map any previously unseen averages into the defined embedding space. Firstly, we extracted the relevant features of the patterns: for each of our 1000 averages we Fourier-transformed its azimuthal intensity variation at every radius and kept the absolute values to make the pattern invariant to in-plane rotations. We then kept frequency signals within the spatial resolution at each radius as the training features, as shown in Figs. 3 ▸(a)–3 ▸(c). We used the coordinates of the averages in CLPCA space as training labels, as shown in Fig. 3 ▸(e).

Figure 3

A brief illustration of the absolute embedding neural network (NN) model. (a) The average pattern from Dragonfly. (b) A polar representation of the pattern. (c) A stack of 1D Fourier transform magnitudes along the angular axis for each radial bin. The odd frequency components (due to inversion symmetry) and the higher frequencies for signals at smaller radii have been removed. These represent the feature vectors for the neural network. (d) An example of using absolute CLPCA on a selected cubic subset of frames (red dots). The grey dots represent the embedding of the pattern averages from the whole dataset. (e) Training labels from CLPCA.

The neural network used for fitting the relation between the pattern of 2D averages and their coordinates in CLPCA space has four fully connected hidden layers with 512, 128, 64 and 32 nodes per layer, respectively. From the 1000 patterns used to calculate the similarity matrix, 800 were used to train the model, which was then validated with the rest of the 200 patterns with mean-square errors of 0.088, 0.059 and 0.042 for components 1, 2 and 3, respectively. With this model, we were quickly able to find the absolute embedding of any single 2D intensity model and zoom into arbitrary subsets of patterns while still retaining a reference to the full data set. An example is shown in Fig. 3 ▸(d) where we use CLPCA on a manually selected subset of averages in the ‘cube’ region of CLPCA space. The frames which contributed to these selected averages were classified using Dragonfly. The embedding of these new averages is shown in red with reference to the whole-dataset CLPCA embedding. In summary, the CLPCA method provides a robust and mostly parameter-free framework to classify and visualize the structural landscape of an arbitrary set of coherent diffraction patterns.

3D reconstruction of heterogeneous models

In the previous section we found a sequence where the shape of the particles transitioned from cubic to spherical. To understand this transition further, we would like to be able to reconstruct the 3D models along the sequence. Given a set of diffraction patterns from a discrete set of reproducible objects, it is relatively straightforward to generate multiple 3D models using the EMC algorithm (Ayyer et al., 2021 ▸; Cho et al., 2021 ▸). Here, this approach is difficult for two reasons: (i) it needs to assume a number of discrete heterogeneity models, which does not qualitatively capture the continuous cube–sphere transition, and (ii) we do not have enough patterns located in the transition region to reconstruct 3D structures via EMC. Fortunately, previous studies in cryo-electron microscopy single-particle analysis provide a good way of modelling the continuous heterogeneity. Zhong et al. (2021 ▸) developed cryoDRGN, a variational auto-encoder (VAE) for the efficient reconstruction of heterogeneous complexes and continuous trajectories of protein motions. Inspired by their paper, we developed a similar deep learning model by combining a VAE and convolutional neural networks (CNNs) to model the continuous shape transition along this ‘melting’ sequence.

Variational auto-encoders

Variational auto-encoders (VAEs) are a neural network architecture introduced by Kingma & Welling (2019 ▸), extensively used as generative networks and to understand the internal relationships of a dataset. They are a form of auto-encoder which are neural networks which attempt to recover the input data using a network with a bottleneck layer consisting of only a few neurons, representing the dimensionally reduced dataset. Fig. 4 ▸(a) shows the structure of our VAE neural network, consisting of a 2D CNN pattern encoder to encode 2D patterns, along with their orientation estimates, into distributions of latent parameters, and a 3D transposed convolution network as volume decoder to generate 3D intensity volumes from latent numbers. This setting allows the neural networks to learn the 3D-heterogeneity structure-encoded latent numbers from the diffraction patterns.

Figure 4

(a) A schematic diagram of the variational auto-encoder (VAE). It consists of two convolutional neural networks as encoder and decoder, and the model is fitted by comparing the similarity between the input and decoder-generated 2D slices with additional regularization by assuming the latent parameter follows an N(0, 1) distribution. (b) The distribution of the 10 000 bootstrapped 2D average patterns in CLPCA space (grey). Red dots are selected average patterns along the melting sequence used for the VAE analysis. (c) CLPCA-2 plotted against gain for the selected patterns from panel (b) along the melting sequence.

The inputs of the VAE network are the 2D intensity averages used as input for the CLPCA method in the previous section and their associated relative orientations. The former are obtained from the Dragonfly output discussed in Section 2.2.1, while the latter are calculated as described below. We start with the 3D intensity volume of an ideal 42 nm cubic particle, slicing it with 16 407 orientations distributed uniformly in quaternion space within the O subgroup to account for the symmetry of the object. This subgroup was chosen for this sample since we observed no symmetry breaking in either the single-model reconstruction in Ayyer et al. (2021 ▸) or in any of the 2D averages. The procedure following the choice of samples is identical even if there is no symmetry expected. The cross-correlation coefficient (CC) of each 2D model is calculated with all the orientations. The orientation with the highest CC is recorded for each model. In addition, another parameter which helps in training is termed ‘gain’, referring to the ratio of the best to the average CC over all orientations. This parameter can also indicate the ‘cubicness’ of the particle since patterns from cubes fit very well to the synthetic model. The VAE works in the following manner: For each input average X, its orientation Ω and gain , the encoder network generates a Gaussian distribution of its latent variable values N(μ, σ). The fact that this is not just a single latent vector z but a distribution makes the auto-encoder variational and enforces the smoothness of the latent space. It also enables the network to use information from neighbouring regions in the space to update regions with limited data. A latent vector z is sampled using this distribution and used by the decoder to generate a 3D intensity volume V 3D, which is then symmetrized by the octahedral point group. The known orientation is then used to slice the generated 3D volume to get a reconstructed 2D pattern X′. The goal of training is to minimize the difference between X and X′ (further details in Appendix B ). Three-dimensional information is obtained since the same latent space region is sampled by averages in a variety of orientations. Note that although the is not strictly necessary for the VAE network to reconstruct 3D volumes, it helps to regularize the latent space of z. When the model is trained, we can encode the 2D diffraction patterns into the latent space via the encoder network. The VAE network not only reconstructs the 3D structure of any given input 2D diffraction pattern but can also do so for any chosen location in the latent space. In this way, it can be used to study the continuous transition between cube and sphere in the melting sequence, including in regions where there are not sufficient patterns to isolate and obtain a conventional reconstruction.

Tracing the melting sequence

As mentioned before, using Dragonfly to generate 2D averages comes at the cost of potentially averaging out structural variations in individual frames due to the limited number of classes. To generate the CLPCA landscape, 1000 averages were deemed to be enough to cover most of the major shapes in the sample, but in order to trace the continuous shape variation along the melting sequence more detailed minor variation information needs to be retained. To keep more of this variation information along the melting sequence we repeat the same bootstrapping plus Dragonfly method as described in Section 2.2.1 to generate 10 000 average patterns. Using the CLPCA method in combination with the absolute embedding approach discussed in Section 2.2.3, we selected 5965 of these averages located along the melting sequence. Fig. 4 ▸(b) shows the distribution of these averages and the selected melting sequence averages in CLPCA space. As discussed above, the CLPCA-2 is roughly aligned with the melting sequence, which can be used as a coarse estimator of the melting process. Fig. 4 ▸(c) shows the comparison between CLPCA-2 and the gain for the selected melting sequence models. Although correlated with the melting sequence at a certain level, by visual inspection seems to be good at tracing the cubes, while CLPCA-2 is better at isolating the spheres. In order to reduce computational cost, the input 2D average intensities were downsampled and truncated slightly at high q. Since the intensity distribution of a compact object is heavily weighted to low q, the input data were normalized by the azimuthally averaged intensity over the whole dataset. Details of the VAE network structure and various pre-processing steps are given in Appendix B . For simplicity, we start by modelling with a 1D latent space, so the VAE model is forced to trace the strongest variation along the melting sequence/cubic–spherical transition. Fig. 5 ▸(a) shows the VAE-encoded 1D latent parameter z plotted against CLPCA-2 for the input models, colour coded by the parameter: brighter yellow indicates that the particles are more anisotropic. Figs. 5 ▸(b) and 5 ▸(c) are the histograms of z and CLPCA-2, respectively. Though there exists a strong correlation between the latent parameter z and CLPCA-2, they show different distributions in tracing the cubic–spherical transition. First, the VAE further separates the cubic side cluster classified by CLPCA (CLPCA-2 < −0.25). Secondly, at lower than z ≃ −1.5 another sequence is visible which is not clearly seen in CLPCA space.

Figure 5

(a) VAE-encoded 1D latent number z plotted against CLPCA-2 for the input sample patterns. The colour coding is the gain parameter of each pattern. (b) A histogram of latent number z. (c) A histogram of CLPCA-2. (d) The volume ‘evolution’ along z, with volumes calculated with a density threshold equal to 10−4 of the total mass. The dashed horizontal line is the size of the support volume in phase retrieval. Vertical grey lines show the locations of the 12 selected z numbers. (e) The top three rows are the VAE-generated intensity volumes from the 12 selected z numbers on a logarithmic scale, the three rows showing slices in (from row 1 to row 3) the 〈100〉, 〈110〉 and 〈111〉 directions, respectively. The bottom three rows show their corresponding density projections in real space.

One advantage of the VAE network is that it allows us to reconstruct/generate the 3D intensity volume for any given z. Here, we selected 12 z numbers of equal steps between −2.00 and 2.67. In Fig. 5 ▸(e), the upper three rows show three special (〈100〉, 〈110〉 and 〈111〉) slices of the 12 VAE-generated intensity volumes on a logarithmic scale. We see a clear and smooth cubic–spherical transition from z ≃ 2.7 → −1.5, which shows that the VAE network is able to trace the early melting stages much better than the CLPCA. For the second ‘sequence’ at z < 1.5, the lower contrast in the intensities and the less-spherical structures indicate they are likely to be contaminated particles at late melting stages, the low contrast being due to the fact that they contain various shapes and a lack of accurate orientation estimates. Compared with the CLPCA method that roughly classifies cubic and spherical particles, the VAE is able to provide a detailed smooth modelling of the entire transition process. The bottom three rows of Fig. 5 ▸(e) show projections of phase-retrieved density maps in real space along the same three directions. We also find the volume increases by around 10% from cubes to spheres along the melting sequence by calculating the volume evolution of 3D reconstructed volumes at 200 different finely sampled z values, shown as black dots in Fig. 5 ▸(d). This is consistent with the density difference between crystalline gold (19.32 g cm−3) and molten/randomly close-packed atoms (17.31 g cm−3). This is also in agreement with the direct size fitting on 2D average patterns of spherical and cubic particles, shown in Fig. 9. This volume analysis allows us to observe an additional feature, namely that the melting sequence seems to show two different stages: Stage 1 is from z ≃ 3 → −0.3, where most of the shape change occurs, but with no significant volume changes. Stage 2 is from z ≃ −0.3 → −1.5, where all particles are mostly round but the volume increases. This is to be expected, since edges and vertices require lower energies to be disrupted than for complete melting, where the bulk crystal structure is lost (Cahn, 1986 ▸; Chen et al., 2021 ▸). As discussed below, the melting sequence here is only based on the 3D structure of the particles. Thus, the ‘two stages’ might only suggest that the shape changing happened ‘before’ the size expansion along z (exposure intensity). We note that the melting sequence here is purely based on the 3D structure of the particles. Since we observe a continuous transition, we assume a monotonic relationship between the latent variable z and the incident fluence in the previous pulse. However, without additional information about the expected distribution of particles with a given incident fluence, we cannot obtain the precise mapping between the two. As all pre-exposures happened 880 ns before the imaging pulse, one should not view the melting sequence as a time-dependent shape variation, but rather an incident fluence- or temperature-dependent behaviour. We have shown that the 1D latent space VAE network is capable of modelling the major transition, but it is not the only variation in the dataset. To trace more variations we now model it with a higher-dimensional latent space. Fig. 6 ▸(a) shows the 2D VAE-encoded latent parameters of the sample. Neither the first nor the second latent parameter traces the cubic–spherical transition independently. This shows the encoding of more secondary variations, e.g. contaminants, size etc.

Figure 6

(a) The distribution of the dataset in 2D latent space, colour coded by CLPCA-2. Black crosses in the plot are the 200 selected latent numbers z. (b) The 〈100〉 direction projections of the 200 real-space densities retrieved from the logarithmic intensity volumes generated from the 200 selected latent numbers z [black crosses in panel (a)]. Semi-transparent slices are volumes generated from latent regions with no or very few data points.

We selected 200 evenly distributed points in the latent space and generated their 3D intensity volumes. Fig. 6 ▸(b) shows the 〈100〉 direction projection maps of the phase-retrieved VAE-generated intensity volumes of the 200 selected latent numbers z. One can clearly see that the transition between spherical and cubic features from right to left is roughly encoded in the first z component. However, the trend is twisted in 2D space and we obtained multiple transition sequences along the second component. In regions with no or very few data points (semi-transparent slices), the results are less reliable as the VAE network could not generate reliable intensity volumes.

Conclusions

In this work, we have demonstrated two methods to study heterogeneous ensembles of samples using X-ray single-particle imaging by analysing the relationships between the 3D structures of the samples rather than the diffraction patterns directly. The first, termed CLPCA, uses the fact that diffraction patterns from ideal particles share a common line/arc from the Ewald construction. The similarity of the best common lines allows us to visualize a low-dimensional embedding of the dataset which roughly represents 3D similarity regardless of orientation. A fully connected neural network was trained against the embedding in order to embed arbitrary 2D averages with respect to the rest of the data set. This was applied to an experimental dataset from gold nanoparticles, containing 2.4 million patterns with only a few thousand photons per frame. In order to improve the signal-to-noise ratio and fill in detector panel gaps, 2D classification in Dragonfly was used to combine similar patterns up to in-plane rotations, and a bootstrapping approach was used to increase the number of 2D averages with subsets of the data. The second method involves a generative VAE network that models continuous structural deformations, and can be used to generate 3D structures at arbitrary points along the landscape. In the studied dataset, this network was able to recover a continuous melting sequence induced by pre-exposure of the particles to the previous pulse in the European XFEL pulse train. At low fluences, we observed a rounding of the particle shape, while at higher fluences we additionally observed a volume expansion, probably caused by atomic scale disordering at higher temperatures. Our two methods open up the possibility of studying samples of heterogeneous particles and, potentially, tracing the dynamic motion of particles in SPI experiments.

19 in total

1. Sorting algorithms for single-particle imaging experiments at X-ray free-electron lasers.

Authors: S A Bobkov; A B Teslyuk; R P Kurta; O Yu Gorobtsov; O M Yefanov; V A Ilyin; R A Senin; I A Vartanyants
Journal: J Synchrotron Radiat Date: 2015-10-22 Impact factor: 2.616

2. Three-Dimensional Structure Determination from Common Lines in Cryo-EM by Eigenvectors and Semidefinite Programming().

Authors: A Singer; Y Shkolnisky
Journal: SIAM J Imaging Sci Date: 2011-06-07 Impact factor: 2.867

3. Reconstruction algorithm for single-particle diffraction imaging experiments.

Authors: Ne-Te Duane Loh; Veit Elser
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2009-08-24

4. Unsupervised classification of single-particle X-ray diffraction snapshots by spectral clustering.

Authors: Chun Hong Yoon; Peter Schwander; Chantal Abergel; Inger Andersson; Jakob Andreasson; Andrew Aquila; Saša Bajt; Miriam Barthelmess; Anton Barty; Michael J Bogan; Christoph Bostedt; John Bozek; Henry N Chapman; Jean-Michel Claverie; Nicola Coppola; Daniel P DePonte; Tomas Ekeberg; Sascha W Epp; Benjamin Erk; Holger Fleckenstein; Lutz Foucar; Heinz Graafsma; Lars Gumprecht; Janos Hajdu; Christina Y Hampton; Andreas Hartmann; Elisabeth Hartmann; Robert Hartmann; Gunter Hauser; Helmut Hirsemann; Peter Holl; Stephan Kassemeyer; Nils Kimmel; Maya Kiskinova; Mengning Liang; Ne-Te Duane Loh; Lukas Lomb; Filipe R N C Maia; Andrew V Martin; Karol Nass; Emanuele Pedersoli; Christian Reich; Daniel Rolles; Benedikt Rudek; Artem Rudenko; Ilme Schlichting; Joachim Schulz; Marvin Seibert; Virginie Seltzer; Robert L Shoeman; Raymond G Sierra; Heike Soltau; Dmitri Starodub; Jan Steinbrener; Gunter Stier; Lothar Strüder; Martin Svenda; Joachim Ullrich; Georg Weidenspointner; Thomas A White; Cornelia Wunderer; Abbas Ourmazd
Journal: Opt Express Date: 2011-08-15 Impact factor: 3.894

5. Three-dimensional reconstruction of the giant mimivirus particle with an x-ray free-electron laser.

Authors: Tomas Ekeberg; Martin Svenda; Chantal Abergel; Filipe R N C Maia; Virginie Seltzer; Jean-Michel Claverie; Max Hantke; Olof Jönsson; Carl Nettelblad; Gijs van der Schot; Mengning Liang; Daniel P DePonte; Anton Barty; M Marvin Seibert; Bianca Iwan; Inger Andersson; N Duane Loh; Andrew V Martin; Henry Chapman; Christoph Bostedt; John D Bozek; Ken R Ferguson; Jacek Krzywinski; Sascha W Epp; Daniel Rolles; Artem Rudenko; Robert Hartmann; Nils Kimmel; Janos Hajdu
Journal: Phys Rev Lett Date: 2015-03-02 Impact factor: 9.161

6. Deep neural networks for classifying complex features in diffraction images.

Authors: Julian Zimmermann; Bruno Langbehn; Riccardo Cucini; Michele Di Fraia; Paola Finetti; Aaron C LaForge; Toshiyuki Nishiyama; Yevheniy Ovcharenko; Paolo Piseri; Oksana Plekan; Kevin C Prince; Frank Stienkemeier; Kiyoshi Ueda; Carlo Callegari; Thomas Möller; Daniela Rupp
Journal: Phys Rev E Date: 2019-06 Impact factor: 2.529

7. Low-signal limit of X-ray single particle diffractive imaging.

Authors: Kartik Ayyer; Andrew J Morgan; Andrew Aquila; Hasan DeMirci; Brenda G Hogue; Richard A Kirian; P Lourdu Xavier; Chun Hong Yoon; Henry N Chapman; Anton Barty
Journal: Opt Express Date: 2019-12-23 Impact factor: 3.894

8. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks.

Authors: Ellen D Zhong; Tristan Bepler; Bonnie Berger; Joseph H Davis
Journal: Nat Methods Date: 2021-02-04 Impact factor: 28.547

9. Structure determination from single molecule X-ray scattering with three photons per image.

Authors: Benjamin von Ardenne; Martin Mechelke; Helmut Grubmüller
Journal: Nat Commun Date: 2018-06-18 Impact factor: 14.919

10. The Adaptive Gain Integrating Pixel Detector at the European XFEL.

Authors: Aschkan Allahgholi; Julian Becker; Annette Delfs; Roberto Dinapoli; Peter Goettlicher; Dominic Greiffenberg; Beat Henrich; Helmut Hirsemann; Manuela Kuhn; Robert Klanner; Alexander Klyuev; Hans Krueger; Sabine Lange; Torsten Laurus; Alessandro Marras; Davide Mezza; Aldo Mozzanica; Magdalena Niemann; Jennifer Poehlsen; Joern Schwandt; Igor Sheviakov; Xintian Shi; Sergej Smoljanin; Lothar Steffen; Jolanta Sztuk-Dambietz; Ulrich Trunk; Qingqing Xia; Mourad Zeribi; Jiaguo Zhang; Manfred Zimmer; Bernd Schmitt; Heinz Graafsma
Journal: J Synchrotron Radiat Date: 2019-01-01 Impact factor: 2.616