Xuanxuan Li1, Chufeng Li2, Haiguang Liu3. 1. Department of Engineering Physics, Tsinghua University, Beijing 100084, China. 2. Department of Physics, Arizona State University, Tempe, AZ 85287, USA. 3. Complex Systems Division, Beijing Computational Science Research Center, ZPark II, Haidian, Beijing 100193, China.
Abstract
Recent developments of two-color operation modes at X-ray free-electron laser facilities provide new research opportunities, such as X-ray pump/X-ray probe experiments and multiple-wavelength anomalous dispersion phasing methods. However, most existing indexing methods were developed for indexing diffraction data from monochromatic X-ray beams. Here, a new algorithm is presented for indexing two-color diffraction data, as an extension of the sparse-pattern indexing algorithm SPIND, which has been demonstrated to be capable of indexing diffraction patterns with as few as five peaks. The principle and implementation of the two-color indexing method, SPIND-TC, are reported in this paper. The algorithm was tested on both simulated and experimental data of protein crystals. The results show that the diffraction data can be accurately indexed in both cases. Source codes are publicly available at https://github.com/lixx11/SPIND-TC. open access.
Recent developments of two-color operation modes at X-ray free-electron laser facilities provide new research opportunities, such as X-ray pump/X-ray probe experiments and multiple-wavelength anomalous dispersion phasing methods. However, most existing indexing methods were developed for indexing diffraction data from monochromatic X-ray beams. Here, a new algorithm is presented for indexing two-color diffraction data, as an extension of the sparse-pattern indexing algorithm SPIND, which has been demonstrated to be capable of indexing diffraction patterns with as few as five peaks. The principle and implementation of the two-color indexing method, SPIND-TC, are reported in this paper. The algorithm was tested on both simulated and experimental data of protein crystals. The results show that the diffraction data can be accurately indexed in both cases. Source codes are publicly available at https://github.com/lixx11/SPIND-TC. open access.
Entities:
Keywords:
indexing algorithm; serial crystallography; two-color diffraction
Over the past few years, serial femtosecond crystallography (SFX) has demonstrated the capabilities of determining three-dimensional macromolecular structures from microcrystals (Chapman et al., 2011 ▸; Boutet et al., 2012 ▸; Barends et al., 2014 ▸; Kupitz et al., 2014 ▸). Using femtosecond pulses of bright X-ray free-electron lasers (XFELs), diffraction signals are recorded from protein crystals at room temperature in the ‘diffraction-before-destruction’ approach (Solem, 1986 ▸; Neutze et al., 2000 ▸). This scheme avoids the structure alteration in the cryogenic cooling process (Fraser et al., 2011 ▸; Keedy et al., 2014 ▸), which is frequently adopted to protect protein crystals from radiation damage in macromolecular diffraction experiments at synchrotron facilities.In contrast to the conventional macromolecular crystallography using synchrotron light sources, where only one or a few large crystals are required for a complete data set using the oscillation approach, SFX experiments usually require thousands to millions of microcrystals to yield a complete data set. Since every crystal sample is destroyed after being illuminated by XFEL pulses, one crystal only produces a single still diffraction pattern. Each diffraction pattern corresponds to a slice of the three-dimensional reciprocal space. Due to the femtosecond duration and the narrow bandwidth of XFEL pulses, only partial intensities of Bragg spots are recorded on each diffraction pattern. Moreover, the variation of crystal size, shape and shot-to-shot XFEL intensity adds more fluctuation to diffraction signals. To reconstruct a reciprocal space with full intensities, each reflection needs to be sampled many times to average out the noise due to these stochastic factors, which in turn requires a large volume of diffraction data.To reduce the sample consumption and experiment time, several attempts have been made to improve the throughput. At the Coherent X-ray Imaging (CXI) instrument (Liang et al., 2015 ▸) of Linac Coherent Light Source (LCLS), researchers can refocus the transmitted beam that passes through the primary chamber, and conduct another independent experiment simultaneously (Boutet et al., 2015 ▸). This serial operation doubles the data collection efficiency but can not reduce the sample consumption. The recent development of XFELs makes it possible to generate a pair of pulses with an adjustable separation of wavelength and time delay (Lutman et al., 2013 ▸; Hara et al., 2013 ▸). This two-color mode doubles the number of diffraction patterns collected from crystals, reducing both the beam time and the sample consumption. Gorel et al. applied the two-color approach in a multiple-wavelength anomalous dispersion experiment to determine the structure of lysozyme, and demonstrated that two-wavelength phases can be potentially more accurate than the single-wavelength case, since the second wavelength produces an additional independent measurement (Gorel et al., 2017b
▸).In SFX experiments, terabytes of diffraction data are collected and processed. Indexed patterns are merged to produce the intensity list for structure determination. At the first stage, the raw data are rapidly sorted and filtered by programs such as Cheetah (Barty et al., 2014 ▸), CASS (Foucar, 2016 ▸) or ClickX (Li et al., 2019b
▸). The resulting diffraction images can be further indexed and merged using the CrystFEL suite (White et al., 2012 ▸). The program indexamajig of CrystFEL is integrated with several auto-indexers, such as MOSFLM (Powell, 1999 ▸), DirAx (Duisenberg, 1992 ▸) and XDS (Kabsch, 1988 ▸, 1993 ▸). Several new indexing algorithms have been developed recently. Brewster et al. developed a new indexing algorithm for sparse patterns, which showed good performance in indexing experimental patterns of peptide nanocrystals with small unit cells (Brewster et al., 2015 ▸). Based on inter-spot vectors, TakeTwo (Ginn et al., 2016 ▸) was shown to improve the indexing rate with the prior knowledge of unit-cell parameters for cubic, hexagonal and orthorhombic space groups. SPIND (Li et al., 2019a
▸) is another prior-unit-cell-knowledge-based method, which searches the best rotation solutions using lengths and angles between pairs of Bragg spots and the origin point of the reciprocal space. FELIX (Beyerlein et al., 2017 ▸) is able to index multiple crystals in serial crystallography patterns, and has been applied to simulated data sets of cubic, tetragonal and monoclinic crystals and experimental data sets from lysozyme microcrystals. The suite cctbx.xfel (Sauter et al., 2013 ▸; Hattne et al., 2014 ▸) represents an alternative set of SFX data processing programs, which can also index multiple crystals.Here, we present an auto-indexing method for two-color diffraction patterns, SPIND-TC, as an extension of the sparse-data indexing method SPIND. This method has been tested on both simulated and experimental data sets, showing accurate indexing results. In particular, the indexing rate for an experimental data set is improved from 11.1% (as in the original work) to 50.9%.
Methods
SPIND-TC is developed based on the sparse-pattern indexing algorithm SPIND, which finds the optimal orientation of a crystal using the prior knowledge of the unit cell as a reference. The core idea is summarized as follows:In equation (1), S is the scoring function to evaluate the matching quality between the observed peaks and predicted peaks by comparing the fractional Miller index, which is denoted by , and the corresponding nearest integer Miller index for the ith peak. The goal is to find a rotation matrix U, such that the scoring function S (the number of matched peaks) is maximized. B is the orthogonalization matrix of the reference lattice, and is the ith reciprocal vector. For any rotation matrix, Miller indices of all peaks are calculated. For the ith peak, if , which is a 3-element tuple (, , ), is small enough, it is considered as a matched peak. The matching criterion is formulated as below: where δ is a user-specified parameter, and is set to 0.25 by default. In other words, if the largest deviation of Miller index is within 0.25, the observed peak is considered to match the predicted peak determined by U and B. The solution with most matched peaks, i.e. largest S, is considered as the best solution.If the raw diffraction patterns (or the Miller indices derived from each diffraction pattern) were directly used to compare with the reference patterns, then many reference patterns are required to sample all possible orientations. This approach is impractical due to the high demands in computation. The SPIND algorithm first converts the Bragg vectors to the representation that is independent of orientations, using [, ] to generate a reference table [see the SPIND paper (Li et al., 2019a
▸) and Appendix A
for details].For two-color pattern indexing, the scoring function is modified to take the two different wavelengths into account. A peak matched to either color is regarded as a matched peak. The workflow is described in Fig. 1 ▸.
Figure 1
Workflow of SPIND-TC. The indexing is performed for two colors independently, which is shown in two blue dashed boxes. The best rotation matrix is refined to obtain the final indexing solution and used to split the peaks into two color groups.
First, the peaks are detected on the two-dimensional diffraction images. The reciprocal vectors are calculated with given geometry parameters. The indexing (or searching) is conducted on the corresponding reciprocal vectors for two wavelengths independently.The searching is a reference-matching process (dashed boxes in Fig. 1 ▸). The reference in SPIND/SPIND-TC is a pre-calculated table for the given space group and unit-cell parameters in the specified resolution range. Each Miller index pair is represented using three parameters, two vector lengths and the angle between the two vectors, hereafter denoted as a reference triple. The reference table is used to match peak pairs from diffraction patterns. The vector lengths and angles are used to narrow down the potential Bragg vectors to a small set, and then the rotation matrix is calculated using the other information in each reference to identify the orientation and assess the matching quality using the Miller indices.For a pattern with N peaks, peak pairs can be generated and sorted by intensity, resolution or signal-to-noise ratio (SNR). Users can select top k peak pairs for matching. The two lengths and one angle for each pair, denoted as an observed triple, are used to match the entries in the reference table within the given tolerance. A rotation matrix U can be calculated for each matched entry. Each rotation solution is evaluated by the scoring function. In monochromatic cases, a peak with small is considered as a matched peak. In two-color cases, a peak with either small or small (1, 2 are used to denote the two colors) is considered as a matched peak.After the reference-based indexing, the peak list is divided into two groups, and , according to the color probabilities. The probability of the ith peak resulting from color j is assigned as below:
Finally a global refinement is performed to optimize the final solution for the following objective function using scipy.optimize (Jones et al., 2001–2020 ▸):
Results
Indexing simulated two-color diffraction data
To validate the ability of SPIND-TC to index two-color diffraction patterns, a simulated data set was generated from protein crystals [Protein Data Bank (PDB) code 5m2t, Prokofev et al., unpublished] at random orientations. The protein crystal has a P1 space group and a unit cell of a = 64.3, b = 72.0, c = 89.2 Å and α = 110.6, β = 107.5, γ = 85.8°. The diffraction patterns are simulated on a 1440 × 1440-pixel virtual detector with a pixel size of 100 × 100 µm. The detector is placed 0.1 m away from the sample downstream of the incident beam and perpendicular to the beam. Because the absolute intensity values are not used for indexing simulated data, a simplified model is used to calculate the diffraction patterns, where structure factors, excitation error and signal noise are not included. The lattice points are modeled as spheres with fixed radius. A lattice point is considered as an excited spot if it is intercepted by the Ewald spheres of photon energy at 7 or 9 keV, and the corresponding peak coordinate and the photon energy are registered. Each simulated diffraction pattern consists of multiple peaks from two photon energies on the two-dimensional detector [see Fig. 2 ▸(a) for an example]. All 100 simulated patterns were indexed successfully with correct orientation solutions (Fig. 2 ▸).
Figure 2
Indexing results on the simulated data set. (a) A typical simulated two-color diffraction pattern using 7 keV (orange peaks) and 9 keV (blue peaks) XFELs in the orientation specified by Euler angles 10, 20, 30°. (b) Probabilities of 7 keV for all peaks in the simulated pattern (a). The probabilities larger than 0.5 corresponding to the peaks of 7 keV (orange). (c) Distribution of orientation errors for each Euler angle. Most of the angle errors are smaller than 0.2°.
Indexing experimental two-color diffraction data
To further investigate the performance of SPIND-TC on actual experimental data, we carried out an indexing test on data set ID 66 (Gorel et al., 2017a
▸) in the Coherent X-ray Imaging Data Bank (CXIDB) (Maia, 2012 ▸). This data set was collected at SACLA (SPring-8 Angstrom Compact free-electron LAser) in 2016, and contains 208 373 diffraction patterns identified with Cheetah. In this experiment, 7 and 9 keV XFEL pulses were used to produce the diffraction images. Since no available program had been developed to index such two-color diffraction data by the time of the work, Gorel et al. used a two-round indexing approach, which utilizes the fact that the two-color images usually consist of two sets of diffraction peaks resulting from XFEL pulses with different fluences, so that the observed Bragg spots can be grouped by intensity. They first used a high-intensity threshold to detect strong peaks and tried indexing assuming 7 or 9 keV photon energy independently. After the first round of indexing, the diffraction images were reprocessed to extract peaks using a low threshold, resulting in more peaks including the ones with weaker signals. Peaks that can be indexed in the first round are masked out for the second round of indexing. The remaining peaks are indexed assuming the other photon energy that was not used for the first round. Using this detect–index–detect–index approach, which is referred to as the CrystFEL-TC approach in this article, 23 144 patterns were indexed for both colors.However, the peak intensities are not only affected by the intensity of the XFEL pulses, but also by characteristics of crystal samples, including structure factors, partiality and mosaicity. The previous approach relies on the intensity-based peak grouping, resulting in low data efficiency. The SPIND-TC algorithm overcomes the dependency on correctly sorting the Bragg peaks based on intensity information. It searches the optimal orientation solution for two colors based on the location of peaks in a single round of data processing, and could improve the indexing rate significantly.To be consistent with the workflow of Gorel et al., indexamajig was used to detect peaks with an intensity threshold of 150 and SNR of 3 to include both strong and weak Bragg peaks. The peak lists were then processed by SPIND-TC. The peaks that were successfully classified into two colors were saved to hdf5 files with the associated indexing solution. Finally, we used indexamajig to check all indexing solutions on the classified peaks, and wrote results to stream files that are compatible with CrystFEL. By following this workflow, 106 154 images were indexed successfully, about 3.6 times more than that using the previous approach (Fig. 3 ▸).
Figure 3
(a) Indexing results with CrystFEL-TC and SPIND-TC on the two-color data set. In 208 000 hits, CrystFEL-TC indexed 23 000 two-color patterns, while SPIND-TC can index 106 000 patterns, covering almost all of the patterns indexed by CrystFEL-TC (except 34 patterns). (b) Representative patterns that can only be indexed with SPIND-TC.
To compare the merged data quality between SPIND-TC and CrystFEL-TC, we followed the instruction in the original paper to index the 9 keV data set with MOSFLM and DirAx, and obtained 30 663 indexed patterns. The indexing results of SPIND-TC and CrystFEL-TC were merged using partialator. Since SPIND-TC had more patterns indexed, the redundancy was improved significantly, as well as R
split and CC*. A higher SNR indicates that SPIND-TC indexed more accurately than the CrystFEL-TC approach (Fig. 4 ▸).
Figure 4
Binned figures-of-merit (FOMs) comparison between SPIND-TC and CrystFEL-TC. SPIND-TC performs better than the CrystFEL-TC method in all FOMs, including redundancy, SNR, R
split and CC*.
Speed test
SPIND-TC is implemented in Python, but the throughput of indexing on experimental protein data is reasonably high. A series of tests were conducted to evaluate the processing speed of SPIND-TC. We used 10, 000 two-color images from the CXIDB 66 repository for all the speed tests. A reference table containing reflections below 5 Å was generated for indexing. The matching tolerances for vector lengths and angles were set to 0.0025 Å and 1°. All peaks were sorted by SNR values. The numbers of peak pairs selected for finding rotation solutions were tested in the range from one to 100, and the corresponding indexing time was 2.2 to 168.7 core-seconds per pattern as shown in Fig. 5 ▸. The utility is defined as the percentage of indexed patterns out of all indexable patterns. In this case, the number of all indexable patterns was determined to be 6192. Users can select a proper number of peak pairs according to the available computation power. With limited computational facility, it is recommended to start with a small number of peak pairs for matching, e.g. five, as the matching targets. It is found that a high utility can still be achieved with fast processing speed by using only five peak pairs from each pattern.
Figure 5
Performance of SPIND-TC at various matching pairs. The blue line shows the processing speed in core-seconds, while the green line represents the utility. If 100 peak pairs are searched in SPIND-TC, 170 core-seconds is required to index a single image, resulting in 6192 images indexed.
Discussion and conclusions
Two-color modes of XFEL provide new opportunities and also bring challenges for serial crystallography. In two-color experiments, the data collection rate is doubled since one image contains two diffraction patterns. For small energy separation, such as 1%, which is feasible at LCLS, it is anticipated that such two-color data can be indexed by monochromatic methods as well as SPIND-TC. This was confirmed by a simulation test with 9 and 9.1 keV energy, where all images could be successfully indexed by MOSFLM and SPIND-TC. The problem for such data is the peak integration for the overlapped spots in the low-resolution region, which requires accurate orientation refinement and careful intensity deconvolution. On the other hand, such data can be analyzed using the pink-beam diffraction method, which uitlizes broader energy bandwidth (up to 5% ) (Meents et al., 2017 ▸).In this work, we focused on the large energy separation cases (e.g. 7 and 9 keV). The two Ewald spheres sample different regions of the reciprocal space and thus are difficult to index using indexing algorithms which are developed for indexing diffraction patterns from monochromatic X-rays. The indexing method SPIND-TC presented here perfectly fulfills the requirements for indexing two-color diffraction patterns. It does not depend on intensity sorting of Bragg spots, and has been tested on both simulated and experimental protein data. The indexing rate for experimental data was increased by approximately 3.6 times compared with the previously reported results. Source codes are publicly available at https://github.com/lixx11/SPIND-TC.
Authors: A A Lutman; R Coffee; Y Ding; Z Huang; J Krzywinski; T Maxwell; M Messerschmidt; H-D Nuhn Journal: Phys Rev Lett Date: 2013-03-25 Impact factor: 9.161
Authors: Johan Hattne; Nathaniel Echols; Rosalie Tran; Jan Kern; Richard J Gildea; Aaron S Brewster; Roberto Alonso-Mori; Carina Glöckner; Julia Hellmich; Hartawan Laksmono; Raymond G Sierra; Benedikt Lassalle-Kaiser; Alyssa Lampe; Guangye Han; Sheraz Gul; Dörte DiFiore; Despina Milathianaki; Alan R Fry; Alan Miahnahri; William E White; Donald W Schafer; M Marvin Seibert; Jason E Koglin; Dimosthenis Sokaras; Tsu-Chien Weng; Jonas Sellberg; Matthew J Latimer; Pieter Glatzel; Petrus H Zwart; Ralf W Grosse-Kunstleve; Michael J Bogan; Marc Messerschmidt; Garth J Williams; Sébastien Boutet; Johannes Messinger; Athina Zouni; Junko Yano; Uwe Bergmann; Vittal K Yachandra; Paul D Adams; Nicholas K Sauter Journal: Nat Methods Date: 2014-03-16 Impact factor: 28.547
Authors: Christopher Kupitz; Shibom Basu; Ingo Grotjohann; Raimund Fromme; Nadia A Zatsepin; Kimberly N Rendek; Mark S Hunter; Robert L Shoeman; Thomas A White; Dingjie Wang; Daniel James; Jay-How Yang; Danielle E Cobb; Brenda Reeder; Raymond G Sierra; Haiguang Liu; Anton Barty; Andrew L Aquila; Daniel Deponte; Richard A Kirian; Sadia Bari; Jesse J Bergkamp; Kenneth R Beyerlein; Michael J Bogan; Carl Caleman; Tzu-Chiao Chao; Chelsie E Conrad; Katherine M Davis; Holger Fleckenstein; Lorenzo Galli; Stefan P Hau-Riege; Stephan Kassemeyer; Hartawan Laksmono; Mengning Liang; Lukas Lomb; Stefano Marchesini; Andrew V Martin; Marc Messerschmidt; Despina Milathianaki; Karol Nass; Alexandra Ros; Shatabdi Roy-Chowdhury; Kevin Schmidt; Marvin Seibert; Jan Steinbrener; Francesco Stellato; Lifen Yan; Chunhong Yoon; Thomas A Moore; Ana L Moore; Yulia Pushkar; Garth J Williams; Sébastien Boutet; R Bruce Doak; Uwe Weierstall; Matthias Frank; Henry N Chapman; John C H Spence; Petra Fromme Journal: Nature Date: 2014-07-09 Impact factor: 49.962
Authors: Sébastien Boutet; Lukas Lomb; Garth J Williams; Thomas R M Barends; Andrew Aquila; R Bruce Doak; Uwe Weierstall; Daniel P DePonte; Jan Steinbrener; Robert L Shoeman; Marc Messerschmidt; Anton Barty; Thomas A White; Stephan Kassemeyer; Richard A Kirian; M Marvin Seibert; Paul A Montanez; Chris Kenney; Ryan Herbst; Philip Hart; Jack Pines; Gunther Haller; Sol M Gruner; Hugh T Philipp; Mark W Tate; Marianne Hromalik; Lucas J Koerner; Niels van Bakel; John Morse; Wilfred Ghonsalves; David Arnlund; Michael J Bogan; Carl Caleman; Raimund Fromme; Christina Y Hampton; Mark S Hunter; Linda C Johansson; Gergely Katona; Christopher Kupitz; Mengning Liang; Andrew V Martin; Karol Nass; Lars Redecke; Francesco Stellato; Nicusor Timneanu; Dingjie Wang; Nadia A Zatsepin; Donald Schafer; James Defever; Richard Neutze; Petra Fromme; John C H Spence; Henry N Chapman; Ilme Schlichting Journal: Science Date: 2012-05-31 Impact factor: 47.728
Authors: Sébastien Boutet; Lutz Foucar; Thomas R M Barends; Sabine Botha; R Bruce Doak; Jason E Koglin; Marc Messerschmidt; Karol Nass; Ilme Schlichting; M Marvin Seibert; Robert L Shoeman; Garth J Williams Journal: J Synchrotron Radiat Date: 2015-04-11 Impact factor: 2.616