| Literature DB >> 27599735 |
Ulrich Zander1, Michele Cianci2, Nicolas Foos1, Catarina S Silva3, Luca Mazzei4, Chloe Zubieta3, Alejandro de Maria1, Max H Nanao1.
Abstract
Recent advances in macromolecular crystallography have made it practical to rapidly collect hundreds of sub-data sets consisting of small oscillations of incomplete data. This approach, generally referred to as serial crystallography, has many uses, including an increased effective dose per data set, the collection of data from crystals without harvesting (in situ data collection) and studies of dynamic events such as catalytic reactions. However, selecting which data sets from this type of experiment should be merged can be challenging and new methods are required. Here, it is shown that a genetic algorithm can be used for this purpose, and five case studies are presented in which the merging statistics are significantly improved compared with conventional merging of all data.Entities:
Keywords: cluster analysis; genetic algorithms; serial crystallography
Mesh:
Substances:
Year: 2016 PMID: 27599735 PMCID: PMC5013596 DOI: 10.1107/S2059798316012079
Source DB: PubMed Journal: Acta Crystallogr D Struct Biol ISSN: 2059-7983 Impact factor: 7.652
Crystal and data-collection parameters
| Macromolecule | Glucose isomerase | Ultralente insulin | Thermolysin | LUX–DNA | Urease |
|---|---|---|---|---|---|
| Space group |
|
|
|
|
|
| Unit-cell parameters (Å, °) |
|
|
|
|
|
| Beamline | ID23-EH2, ESRF | ID23-EH2, ESRF | ID29, ESRF | ID23-EH2, ESRF | P13, PETRA III |
| Wavelength (Å) | 0.8731 | 0.8731 | 1.280 | 0.8731 | 2.0664 |
| Beam size (H × V or diameter) (µm) | 9 × 5 | 9 × 5 | 10 × 10 | 9 × 5 | 30 |
| Crystal size range (µm) | 10 × 10 × 10–30 × 30 × 30 | 5 × 5 × 5–15 × 15 × 15 | 20 × 20 × 100 | 25 × 5 × 5–100 × 5 × 5 | 20 × 20 × 40–20 × 20 × 70 |
| Photon flux (photons s−1) | 1.6 × 1011 | 7.0 × 1010 | 4.1 × 1011, 8.4 × 1011 | 4.4 × 1010 | 3.4 × 1011 |
| Exposure per image (s) | 0.1 | 0.25 | 0.037 | 0.1 | 0.04 |
| No. of images per sub-data set | 140 | 100 | 100 | 100 | 300 |
| Dose per sub-data set (average diffraction-weighted dose) (MGy) | 6.0–7.3 | 5.3–7.5 | 3.0–6.2 | 1.67 | 0.48 |
| Dose per sub-data set (average dose exposed region) (MGy) | 1.1–7.8 | 2.96–10.5 | 4.2–8.7 | 0.20–0.63 | 0.82 |
| Oscillation range (°) | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
| Total angular range per sub-data set (°) | 14 | 10 | 10 | 10 | 30 |
Figure 1Schematic diagram of the genetic algorithm steps. In this example there are four individuals, with nine sub-data sets to be segregated into three groups. The individuals are first initialized randomly; the nine sub-data sets are assigned randomly to group 1, 2 or 3. Within an individual, three scaling runs in XSCALE are then performed, one for each group. The merging statistics are then converted to fitness scores, and the individual receives the fitness for the highest group (it is also possible to use the average fitness). In this case, individual 4 is removed from the population because of lower fitness (fitness values are not shown) and replaced with a new individual. The DEAP built-in mutation and crossover genetic modifiers are then applied, followed by cycling back to the scoring step. The background colour indicates the source of the chromosome. For example, after the crossover step between individuals 1 and 2, two ‘new’ individuals are created consisting of (i) the group assignments of sub-data sets 1–4 from individual 1 and the group assignments of sub-data sets 5–9 from individual 2 and (ii) the group assignments of sub-data sets 5–9 from individual 1 and the group assignments of sub-data sets 1–4 from individual 2. After crossover, mutations are randomly introduced as shown (yellow circles).
Figure 2Improvement of data statistics. The horizontal axis represents progress of the algorithm. Upper panel: GA fitness is improved by the algorithm. Lower panel: the inner-shell 〈I/σ(I)〉 segregates into individuals with optimized values and suboptimal values.
Three columns are used for each system, with the first listing data resulting from merging all sub-data sets, the next from the best GA run and the last from the best HCA cluster. Note that for average sub-data-set parameters, not all sub-data sets contained enough reflections to calculate merging statistics [R meas and 〈I/σ(I)〉].
| Glucose isomerase | Ultralente insulin | Thermolysin | |||||||
|---|---|---|---|---|---|---|---|---|---|
| All | GA | HCA | All | GA | HCA | All | GA | HCA | |
| No. of sub-data sets | 30 | 21 | 20 | 53 | 30 | 19 | 206 | 8 | 36 |
| HCA CC cutoff | — | — | 0.98 | — | — | 0.97 | — | — | 0.93 |
| GA population size | — | 20 | — | — | 25 | — | — | 20 | — |
| GA generations | — | 300 | — | — | 300 | — | — | 60 | — |
| GA | — | 0.5 | — | — | 4 | — | — | 8 | — |
| GA | — | 1.5 | — | — | 4 | — | — | 1 | — |
| GA CC1/2 weight | — | 2 | — | — | 5 | — | — | — | |
| GA groups | — | 3 | — | — | 3 | — | — | 8 | — |
| Data sets in common between GA and HCA | — | 19 | — | — | 15 | — | — | 0 | — |
| Sub data-set | 20.6 (22.9) | 9.0 (3.9) | 9.8 (3.9) | 9.7 (6.5) | 8.5 (5.7) | 9.4 (5.1) | 27.2 (27.8) | 5.6 (2.3) | 10.2 (3.8) |
| Sub data-set 〈 | 7.6 (4.6) | 9.7 (3.7) | 9.8 (3.9) | 14.8 (8.5) | 15.4(8.2) | 17.6 (8.9) | 7.2 (5.8) | 15.4 (4.8) | 9.2 (4.5) |
| Sub data-set completeness | 23.8 (1.4) | 24.4 (0.8) | 24.4 (0.8) | 13.9 (0.7) | 13.9 (0.8) | 14.0 (0.6) | 44.8 (4.4) | 46.1 (3.8) | 43.9 (3.3) |
| Resolution range (Å) | |||||||||
| Overall | 46.6–1.53 | 46.6–1.53 | 46.6–1.53 | 41.13–1.50 | 41.13–1.50 | 41.13–1.50 | 46.5–1.65 | 46.5–1.65 | 46.5–1.65 |
| Outer shell | 1.57–1.53 | 1.57–1.53 | 1.57–1.53 | 1.54–1.50 | 1.54–1.50 | 1.54–1.50 | 1.69–1.65 | 1.69–1.65 | 1.69–1.65 |
| Total No. of reflections | |||||||||
| Overall | 1111281 | 784525 | 751377 | 2019411 | 114748 | 72438 | 8387107 | 322089 | 152135 |
| Outer shell | 77220 | 54777 | 52212 | 14256 | 8106 | 5326 | 958429 | 24310 | 103756 |
| No. of unique reflections | |||||||||
| Overall | 71803 | 71520 | 72007 | 13640 | 13631 | 13547 | 40366 | 39986 | 40387 |
| Outer shell | 5281 | 5243 | 5316 | 1005 | 1004 | 1037 | 2758 | 2943 | 2756 |
| Completeness (%) | |||||||||
| Inner shell | 99.3 | 98.9 | 98.9 | 98.7 | 99.3 | 99.3 | 99.8 | 99.3 | 99.5 |
| Outer shell | 100.0 | 99.8 | 99.9 | 100.0 | 99.9 | 99.7 | 100.0 | 100.0 | 100.0 |
| Overall | 100.0 | 99.9 | 99.6 | 100.0 | 99.9 | 99.5 | 100.0 | 99.0 | 100.0 |
| Multiplicity | |||||||||
| Inner shell | 14.8 | 10.4 | 10.0 | 16.1 | 9.2 | 5.8 | 192.1 | 7.3 | 34.3 |
| Outer shell | 14.6 | 10.4 | 9.8 | 14.2 | 8.1 | 5.1 | 205.7 | 8.3 | 37.6 |
| Overall | 15.5 | 10.9 | 10.4 | 14.8 | 8.4 | 5.3 | 207.7 | 8.0 | 37.7 |
|
| |||||||||
| Inner shell | 14.7 | 9.4 | 8.0 | 43.4 | 6.8 | 7.0 | 39.0 | 8.7 | 11.8 |
| Outer shell | 249.1 | 153.8 | 170.0 | 108.1 | 85.4 | 74.5 | 379.0 | 171.2 | 119.7 |
| Overall | 33.9 | 20.4 | 21.3 | 34.0 | 9.1 | 9.2 | 91.3 | 27.4 | 25.2 |
|
| |||||||||
| Inner shell | 15.2 | 9.9 | 8.4 | 44.8 | 7.2 | 7.6 | 39.2 | 9.4 | 12.0 |
| Outer shell | 258.1 | 161.5 | 179.1 | 112.1 | 91.2 | 82.9 | 380.7 | 182.5 | 121.3 |
| Overall | 35.1 | 21.4 | 22.3 | 35.3 | 9.7 | 10.2 | 91.6 | 29.5 | 25.5 |
| 〈 | |||||||||
| Inner shell | 27.9 | 27.1 | 26.8 | 31.9 | 33.4 | 23.4 | 66.2 | 99.4 | 27.3 |
| Outer shell | 2.5 | 2.5 | 2.4 | 2.6 | 2.4 | 1.9 | 1.4 | 1.6 | 4.1 |
| Overall | 10.7 | 10.6 | 10.3 | 13.3 | 12.9 | 9.7 | 15.2 | 17.0 | 12.9 |
| SigAno | |||||||||
| Inner shell | — | — | — | — | — | — | — | — | — |
| Outer shell | — | — | — | — | — | — | — | — | — |
| Overall | — | — | — | — | — | — | — | — | — |
| CC1/2(%) | |||||||||
| Inner shell | 97.9 | 99.5 | 99.5 | 99.7 | 99.8 | 99.7 | 97.7 | 98.2 | 99.5 |
| Outer shell | 68.0 | 70.8 | 66.6 | 77.4 | 79.1 | 67.3 | 69.7 | 55.3 | 91.5 |
| Overall | 99.2 | 99.5 | 99.5 | 99.3 | 99.8 | 99.7 | 99.1 | 91.7 | 99.7 |
| LUX–DNA | Urease | |||||
|---|---|---|---|---|---|---|
| All | GA | HCA | All | GA | HCA | |
| No. of sub-data sets | 204 | 36 | 77 | 127 | 39 | 79 |
| HCA CC cutoff | — | — | 0.75 | — | — | 0.8 |
| GA population size | — | 25 | — | — | 20 | — |
| GA generations | — | 300 | — | — | 250 | — |
| GA | — | 1 | — | — | 1 | — |
| GA | — | 2.5 | — | — | 3 | — |
| GA CC1/2 weight | — | 1 | — | — | 2 | — |
| GA groups | — | 3 | — | — | 3 | — |
| Data sets in common between GA and HCA | — | 26 | — | — | 34 | — |
| Sub data-set |
|
|
| 15.3 (18.9) | 5.2 (3.0) | 6.5 (3.4) |
| Sub data-set 〈 |
|
|
| 10.9 (6.7) | 16.6 (6.2) | 14.1 (5.8) |
| Sub data-set completeness | 5.4 (1.2) | 5.7 (0.9) | 5.8 (0.8) | 42.9 (4.9) | 43.9 (2.3) | 43.5 (3.1) |
| Resolution range (Å) | ||||||
| Overall | 68.86–2.80 | 68.86–2.80 | 68.86–2.80 | 98.38–2.09 | 98.38–2.09 | 98.38–2.09 |
| Outer shell | 2.87–2.80 | 2.87–2.80 | 2.87–2.80 | 2.14–2.09 | 2.14–2.09 | 2.14–2.09 |
| Total No. of reflections | ||||||
| Overall | 292976 | 70083 | 118205 | 20341640 | 5794906 | 12815153 |
| Outer shell | 16922 | 4231 | 7949 | 749997 | 182806 | 608053 |
| No. of unique reflections | ||||||
| Overall | 14499 | 14224 | 14024 | 107644 | 107603 | 105577 |
| Outer shell | 1039 | 928 | 1000 | 7320 | 7016 | 7173 |
| Completeness (%) | ||||||
| Inner shell | 100.0 | 94.5 | 96.2 | 99.9 | 99.9 | 100.0 |
| Outer shell | 100.1 | 89.3 | 99.5 | 100.0 | 87.5 | 100.0 |
| Overall | 99.9 | 98.0 | 99.4 | 100.0 | 99.1 | 100.0 |
| Multiplicity | ||||||
| Inner shell | 22.6 | 5.3 | 9.0 | 224.3 | 66.9 | 146.0 |
| Outer shell | 16.3 | 4.1 | 7.9 | 102.4 | 22.8 | 84.8 |
| Overall | 20.2 | 4.8 | 8.4 | 189.0 | 53.4 | 121.4 |
|
| ||||||
| Inner shell | 71.0 | 17.0 | 32.6 | 93.5 | 6.7 | 8.4 |
| Outer shell | 187.0 | 83.2 | 141.3 | 157.8 | 128.8 | 145.0 |
| Overall | 74.5 | 37.0 | 50.3 | 86.0 | 29.8 | 32.0 |
|
| ||||||
| Inner shell | 72.7 | 18.8 | 35.1 | 93.6 | 6.7 | 8.4 |
| Outer shell | 192.6 | 93.5 | 151.2 | 158.6 | 131.3 | 145.8 |
| Overall | 76.1 | 41.0 | 52.9 | 86.2 | 30.1 | 32.2 |
| 〈 | ||||||
| Inner shell | 11.8 | 16.3 | 11.3 | 101.1 | 121.4 | 102.9 |
| Outer shell | 3.7 | 5.0 | 4.8 | 3.4 | 2.4 | 4.2 |
| Overall | 7.8 | 9.9 | 7.5 | 24.9 | 23.3 | 25.8 |
| SigAno | ||||||
| Inner shell | — | — | — | 3.16 | 3.79 | 3.27 |
| Outer shell | — | — | — | 0.73 | 0.76 | 0.73 |
| Overall | — | — | — | 0.99 | 1.05 | 1.00 |
| CC1/2 (%) | ||||||
| Inner shell | 93.7 | 95.0 | 98.1 | 99.9 | 99.6 | 99.9 |
| Outer shell | 68.1 | 59.6 | 44.9 | 80.5 | 66.6 | 92.8 |
| Overall | 94.4 | 94.3 | 97.0 | 99.9 | 99.8 | 99.9 |
Average values; standard deviations are given in parentheses.
Insufficient reflections in the sub-data sets to obtain statistics.