| Literature DB >> 29717715 |
Keitaro Yamashita1, Kunio Hirata1, Masaki Yamamoto1.
Abstract
In protein microcrystallography, radiation damage often hampers complete and high-resolution data collection from a single crystal, even under cryogenic conditions. One promising solution is to collect small wedges of data (5-10°) separately from multiple crystals. The data from these crystals can then be merged into a complete reflection-intensity set. However, data processing of multiple small-wedge data sets is challenging. Here, a new open-source data-processing pipeline, KAMO, which utilizes existing programs, including the XDS and CCP4 packages, has been developed to automate whole data-processing tasks in the case of multiple small-wedge data sets. Firstly, KAMO processes individual data sets and collates those indexed with equivalent unit-cell parameters. The space group is then chosen and any indexing ambiguity is resolved. Finally, clustering is performed, followed by merging with outlier rejections, and a report is subsequently created. Using synthetic and several real-world data sets collected from hundreds of crystals, it was demonstrated that merged structure-factor amplitudes can be obtained in a largely automated manner using KAMO, which greatly facilitated the structure analyses of challenging targets that only produced microcrystals. open access.Entities:
Keywords: KAMO; automatic data processing; microcrystals; small-wedge data sets
Mesh:
Substances:
Year: 2018 PMID: 29717715 PMCID: PMC5930351 DOI: 10.1107/S2059798318004576
Source DB: PubMed Journal: Acta Crystallogr D Struct Biol ISSN: 2059-7983 Impact factor: 7.652
Figure 1The workflow of KAMO for multiple small-wedge data sets with the program name for each role provided. The external programs that are required are also shown.
Figure 2The KAMO GUI for processing individual wedges and for initiating the merging procedure. The collected data sets are listed with data-collection parameters and data-processing summaries. In the lower panel, the processing details for a selected item can be seen, including log files and plots with respect to image numbers.
Figure 3Grouping of indexed results based on unit-cell parameters in preparation for merging (titin data; simulated 1g1c). In this case all 100 data sets were indexed with a consistent unit cell. The averaged cell in P1 is shown and the possible point-group symmetries are listed with the ‘frequency’ of how many times POINTLESS assigned the symmetry.
Current structure analyses that KAMO has contributed to
| Sample | Polyhedra | Polyhedra | LPA6
| ETBR | TPT | TPT | AtDTX14 | OX2R |
|---|---|---|---|---|---|---|---|---|
| PDB code |
|
|
|
|
|
|
|
|
| Resolution (Å) | 1.68 | 1.55 | 3.2 | 3.5 | 2.1 | 2.2 | 2.6 | 1.96 |
| Space group |
|
|
|
|
|
|
|
|
| Degrees per data set | 5 | 5 | 4, 6 | 10 | 3–10, 30 | 3–10 | 5, 10, 20 | 1–6 |
| No. of data sets collected | 20 | 184 | 397 | 16 | 723 | 332 | 373 | 805 |
| No. of data sets processed for merging | 20 | 155 | 350 | 16 | 599 | 250 | 139 | 768 |
| No. of merged data sets | 14 | 41 | 241 | 14 | 319 | 199 | 100 | 631 |
Cypovirus polyhedra (Abe et al., 2017 ▸).
Lysophosphatidic acid receptor LPA6 (Taniguchi et al., 2017 ▸).
Endothelin ETB receptor bound to bosentan (Shihoya et al., 2017 ▸).
Triose-phosphate/phosphate translocator bound to Pi and 3-PGA (Lee et al., 2017 ▸).
Eukaryotic MATE transporter AtDTX14 (Miyauchi et al., 2017 ▸).
Human orexin 2 receptor (Suno et al., 2017 ▸).
Figure 4Breaking indexing ambiguity in the titin data (simulated 1g1c). The averaged CC with all other wedges in the final (converged) cycle is plotted for two possible modes (h, k, l and −h, l, k). The resolution was validated using the original 1g1c data (shown as symbols and colours). This figure was prepared using ggplot2 (Wickham, 2009 ▸) in R (R Development Core Team, 2008 ▸).
Figure 5Improved anomalous data quality by increasing the scaling batches in the titin data (simulated 1g1c). To increase the number of batches the option xscale.degrees_per_batch=1 was given in kamo.multi_merge. (a) CCano is the correlation coefficient of I (+) − I (−) between random half-sets. (b) Anomalous difference Fourier peak heights calculated using the observed anomalous differences and the 1g1c model with SHELXC and ANODE (Thorn & Sheldrick, 2011 ▸).
Data and SAD phase quality for clusters by unit-cell similarities using BLEND for mercury-bound M2R data (PDB entry 5yc8)
The top eight results sorted by multiplicity are shown. LCV is the linear cell variation defined in BLEND (Foadi et al., 2013 ▸). The inner and outer resolution ranges are 50–7.50 and 2.65–2.50 Å, respectively. The correctness of the located Hg sites was evaluated against those of the 5yc8 model using phenix.emma (Adams et al., 2002 ▸). B Wilson is the Wilson B value reported by CTRUNCATE. Completeness is greater than or equal to 98% for every case. For the calculation of multiplicity, CC1/2 and CCano, Friedel pairs are treated as different reflections.
| Hg correctly located | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| LCV (%) | Cluster height | No. of data sets merged | Multiplicity | CC1/2 (outer) | CCano (inner) | No. | R.m.s.d. (Å) | CCmap |
|
| 19.6 | 88.56 | 454 | 20.7 | 0.611 | 0.80 | 3 | 0.60 | 0.575 | 33.8 |
| 5.7 | 71.17 | 421 | 19.2 | 0.606 | 0.79 | 2 | 0.38 | 0.529 | 33.8 |
| 4.6 | 56.42 | 388 | 17.7 | 0.582 | 0.76 | 3 | 1.16 | 0.506 | 33.3 |
| 4.6 | 48.77 | 283 | 13.0 | 0.460 | 0.67 | 3 | 1.27 | 0.505 | 34.1 |
| 4.0 | 42.69 | 215 | 10.0 | 0.379 | 0.63 | 3 | 0.75 | 0.415 | 33.7 |
| 4.0 | 28.15 | 184 | 8.6 | 0.349 | 0.59 | 2 | 0.67 | 0.317 | 33.7 |
| 2.1 | 15.92 | 135 | 6.3 | 0.264 | 0.54 | 2 | 0.41 | 0.109 | 33.2 |
| 1.5 | 18.51 | 103 | 4.7 | 0.109 | 0.50 | 2 | 0.66 | 0.076 | 29.9 |
Comparison of clustering methods using polyhedra data (PDB entry 5gqn)
The top 12 results sorted by R free are shown out of 99 clusters tested. The inner and outer resolution ranges are 50–4.65 and 1.65–1.55 Å, respectively. 〈B〉 is the averaged atomic B value reported by phenix.refine. Completeness is greater than or equal to 99.8% for every case.
| Clustering method | LCV (%) | No. of data sets merged | Multiplicity | CC1/2 (inner) | CC1/2 (outer) |
|
| 〈 |
|---|---|---|---|---|---|---|---|---|
| CC(| | 0.4 | 18 | 9.7 | 0.989 | 0.753 | 0.1411 | 0.1765 | 10.7 |
| CC(| | 0.4 | 15 | 7.7 | 0.985 | 0.734 | 0.1436 | 0.1797 | 9.7 |
| CC(| | 0.9 | 24 | 13.1 | 0.941 | 0.805 | 0.1443 | 0.1807 | 10.5 |
| CC( | 0.4 | 29 | 15.5 | 0.988 | 0.666 | 0.1434 | 0.1816 | 10.4 |
| CC( | 0.9 | 38 | 19.9 | 0.989 | 0.670 | 0.1447 | 0.1823 | 9.7 |
| CC( | 0.4 | 22 | 11.9 | 0.983 | 0.627 | 0.1437 | 0.1825 | 10.6 |
| CC( | 0.4 | 17 | 8.7 | 0.980 | 0.620 | 0.1459 | 0.1828 | 10.4 |
| CC( | 0.4 | 18 | 9.2 | 0.978 | 0.596 | 0.1475 | 0.1839 | 9.2 |
| CC( | 0.4 | 25 | 13.2 | 0.984 | 0.647 | 0.1449 | 0.1841 | 10.2 |
| CC( | 2.2 | 96 | 49.0 | 0.986 | 0.726 | 0.1474 | 0.1849 | 8.8 |
|
| 1.2 | 110 | 57.3 | 0.984 | 0.722 | 0.1481 | 0.1871 | 8.0 |
| All data | 2.2 | 130 | 67.8 | 0.969 | 0.712 | 0.1551 | 0.1939 | 7.1 |