| Literature DB >> 34963768 |
Sathya R Chitturi1,2, Daniel Ratner2, Richard C Walroth2, Vivek Thampy2, Evan J Reed1, Mike Dunne1,2, Christopher J Tassone2, Kevin H Stone2.
Abstract
A key step in the analysis of powder X-ray diffraction (PXRD) data is the accurate determination of unit-cell lattice parameters. This step often requires significant human intervention and is a bottleneck that hinders efforts towards automated analysis. This work develops a series of one-dimensional convolutional neural networks (1D-CNNs) trained to provide lattice parameter estimates for each crystal system. A mean absolute percentage error of approximately 10% is achieved for each crystal system, which corresponds to a 100- to 1000-fold reduction in lattice parameter search space volume. The models learn from nearly one million crystal structures contained within the Inorganic Crystal Structure Database and the Cambridge Structural Database and, due to the nature of these two complimentary databases, the models generalize well across chemistries. A key component of this work is a systematic analysis of the effect of different realistic experimental non-idealities on model performance. It is found that the addition of impurity phases, baseline noise and peak broadening present the greatest challenges to learning, while zero-offset error and random intensity modulations have little effect. However, appropriate data modification schemes can be used to bolster model performance and yield reasonable predictions, even for data which simulate realistic experimental non-idealities. In order to obtain accurate results, a new approach is introduced which uses the initial machine learning estimates with existing iterative whole-pattern refinement schemes to tackle automated unit-cell solution. © Sathya R. Chitturi et al. 2021.Entities:
Keywords: analysis automation; indexing; machine learning; powder diffraction
Year: 2021 PMID: 34963768 PMCID: PMC8662964 DOI: 10.1107/S1600576721010840
Source DB: PubMed Journal: J Appl Crystallogr ISSN: 0021-8898 Impact factor: 3.304
Figure 1Visualizations of data distribution by crystal system in (a) the ICSD, (b) the CSD, and (c) both the ICSD and the CSD. These two databases exhibit complementary distributions which justifies the choice to combine them.
The 1D-CNN architecture used for PXRD data sets
The same architecture was used for all experiments, although the weights were allowed to vary according to the data.
| Layer (type) | Output shape | No. of parameters |
|---|---|---|
| InputLayer | (9000, 1) | 0 |
| MaxPooling1D (pool size = 3) | (3000, 1) | 0 |
| Conv1D (kernel = 5, filter = 3, activation = ReLU) | (3000, 5) | 20 |
| Conv1D (kernel = 5, filter = 3, activation = ReLU) | (3000, 5) | 80 |
| MaxPooling1D (pool size = 2) | (1500, 5) | 0 |
| Conv1D (kernel = 10, filter = 3, activation = ReLU) | (1500, 10) | 160 |
| Conv1D (kernel = 10, filter = 3, activation = ReLU) | (1500, 10) | 310 |
| MaxPooling1D (pool size = 2) | (750, 10) | 0 |
| Conv1D (kernel = 15, filter = 5, activation = ReLU) | (750, 15) | 765 |
| Conv1D (kernel = 15, filter = 5, activation = ReLU) | (750, 15) | 1140 |
| MaxPooling1D (pool size = 3) | (250, 15) | 0 |
| Conv1D (kernel = 20, filter = 5, activation = ReLU) | (250, 20) | 1520 |
| Conv1D (kernel = 20, filter = 5, activation = ReLU) | (250, 20) | 2020 |
| MaxPooling1D (pool size = 2) | (125, 20) | 0 |
| Conv1D (kernel = 30, filter = 5, activation = ReLU) | (125, 30) | 3030 |
| Conv1D (kernel = 30, filter = 5, activation = ReLU) | (125, 30) | 4530 |
| MaxPooling1D (pool size = 5) | (25, 30) | 0 |
| Flatten | (750) | 0 |
| Dense (activation = ReLU) | (80) | 60080 |
| Dense (activation = ReLU) | (50) | 4050 |
| Dense (activation = ReLU) | (10) | 510 |
| Dense (activation = ReLU) | (3) | 33 |
MAPE for 1D-CNNs for each crystal system
Null prediction refers to a MAPE prediction based on the mean lattice parameters of the data set. ML full corresponds to the prediction from the 1D-CNN models for 0–90° in 2θ. ML reduced corresponds to the prediction from the 1D-CNN models for 0–30° in 2θ.
| Symmetry | Null prediction | ML prediction full | Ratio 1 | ML prediction reduced | Ratio 2 | Training data set size |
|---|---|---|---|---|---|---|
| Cubic | 51.49 | 7.55 | 6.8 | 4.08 | 12.6 | 30 705 |
| Hexagonal | 47.37 | 7.35 | 6.4 | 8.94 | 5.3 | 17 842 |
| Trigonal | 46.58 | 15.72 | 3.0 | 15.30 | 3.0 | 25 784 |
| Tetragonal | 48.77 | 11.56 | 4.2 | 9.09 | 5.4 | 37 183 |
| Orthorhombic | 29.94 | 10.06 | 3.0 | 12.92 | 2.3 | 161 087 |
| Monoclinic | 24.76 | 11.79 | 2.1 | 11.23 | 2.2 | 445 708 |
| Triclinic | 20.06 | 3.11 | 6.5 | 2.68 | 7.5 | 243 651 |
Figure 2The effect of including experimental modifications in testing data on models trained without including any experimental modifications. Results are shown for unit-cell length predictions for (a) the cubic and (b) the triclinic crystal systems. The modifications studied correspond to zero error, intensity modulation, Gaussian broadening, Gaussian baseline noise and multiple impurity phases. Perfect refers to unmodified data and null refers to prediction using the mean lattice parameters of the data set. Zero error and intensity modulation have little effect on ML prediction. On the other hand, baseline noise and multiple phases are particularly damaging modifications.
MAPE for 1D-CNNs trained on unmodified data and tested on data containing modifications
Baseline noise, broadening and impurities damage ML performance, while intensity modulations and peak shifting have little effect. Null refers to predictions based on the mean lattice parameters of the data set. Perfect refers to training and testing on unmodified data and is intended as a control.
| Symmetry | Null | Perfect | Broaden | Baseline | Intensity | Shift | Impurities |
|---|---|---|---|---|---|---|---|
| Cubic | 51.49 | 7.55 | 8.62 | 71.26 | 7.84 | 7.53 | 59.22 |
| Hexagonal | 47.37 | 7.35 | 10.92 | 57.87 | 7.53 | 7.47 | 48.36 |
| Trigonal | 46.58 | 15.72 | 19.03 | 41.14 | 15.97 | 15.97 | 39.25 |
| Tetragonal | 48.77 | 11.56 | 17.15 | 43.02 | 11.74 | 11.62 | 42.02 |
| Orthorhombic | 29.94 | 10.06 | 18.45 | 34.85 | 10.45 | 10.02 | 27.34 |
| Monoclinic | 24.76 | 11.79 | 20.01 | 38.21 | 12.16 | 11.74 | 22.94 |
| Triclinic | 20.06 | 3.11 | 19.87 | 24.11 | 3.46 | 3.15 | 16.02 |
Figure 3The impact of (a), (b) broadening, (c), (d) baseline noise and (e), (f) multiple impurity phases on ML predictions for unit-cell lengths for (left-hand column) cubic and (right-hand column) triclinic crystal systems. NM refers to a pattern with no modifications and M refers to a pattern with the corresponding experimental modification. The notation A/B indicates training with modification A (NM or M) and testing with modification B (NM or M). The null column is a prediction based on the mean lattice parameters of the data set. The performance of ML models is greatly improved when training and testing with modifications (M/M) relative to training on unmodified data and testing on modified data (NM/M).
MAPE for 1D-CNNs trained and tested on modified data
Incorporating modifications into the training set reduces the MAPE for baseline noise, broadening and multiphase impurities. Null refers to predictions based on the mean lattice parameters of the data set. Perfect refers to training and testing on unmodified data and is intended as a control.
| Symmetry | Null | Perfect | Broaden | Baseline | Impurities |
|---|---|---|---|---|---|
| Cubic | 51.49 | 7.55 | 3.4 | 5.2 | 9.4 |
| Hexagonal | 47.37 | 7.35 | 8.0 | 10.2 | 13.6 |
| Trigonal | 46.58 | 15.72 | 15.8 | 16.9 | 22.9 |
| Tetragonal | 48.77 | 11.56 | 12.6 | 15.6 | 18.2 |
| Orthorhombic | 29.94 | 10.06 | 10.3 | 11.5 | 17.46 |
| Monoclinic | 24.76 | 11.79 | 12.4 | 12.3 | 16.02 |
| Triclinic | 20.06 | 3.11 | 3.7 | 4.6 | 10.48 |
MAPE for 1D-CNNs trained on modified data and tested on unmodified data
Prediction is generally worse relative to the perfect condition. However, some classical ML augmentation improvements are apparent for the broadening condition. Null refers to predictions based on the mean lattice parameters of the data set. Perfect refers to training and testing on unmodified data and is intended as a control.
| Symmetry | Null | Perfect | Broaden | Baseline | Impurities |
|---|---|---|---|---|---|
| Cubic | 51.49 | 7.55 | 3.1 | 4.4 | 9.9 |
| Hexagonal | 47.37 | 7.35 | 8.5 | 10.0 | 13.5 |
| Trigonal | 46.58 | 15.72 | 16.7 | 15.5 | 20.0 |
| Tetragonal | 48.77 | 11.56 | 13.0 | 16.5 | 17.5 |
| Orthorhombic | 29.94 | 10.06 | 9.5 | 10.3 | 15.71 |
| Monoclinic | 24.76 | 11.79 | 12.6 | 11.6 | 13.5 |
| Triclinic | 20.06 | 3.11 | 3.1 | 3.8 | 7.5 |
MAPE for 1D-CNNs as a function of the number of visible peaks for models trained on both 0–30° and 0–90° 2θ for the tetragonal crystal system
The noise level value indicates the fraction of the largest reflection. For example, 0.05 corresponds to baseline noise which does not exceed 5% of the largest peak. ML full corresponds to the prediction from the 1D-CNN models for 0–90° in 2θ. ML reduced corresponds to the prediction from the 1D-CNN models for 0–30° in 2θ.
| Noise level | Percentage of visible peaks | Null | ML prediction full | ML prediction reduced |
|---|---|---|---|---|
| 0.0 | 100.0 | 46.58 | 11.56 | 9.09 |
| 0.001 | 85.3 | 46.58 | 13.56 | 12.29 |
| 0.005 | 73.4 | 46.58 | 14.76 | 13.38 |
| 0.01 | 67.0 | 46.58 | 14.12 | 14.19 |
| 0.05 | 44.90 | 46.58 | 17.21 | 18.35 |
| 0.1 | 33.30 | 46.58 | 19.12 | 19.15 |
The average number of impurity peaks for each intensity range of a given PXRD pattern
The intensity level value indicates the fraction of the largest reflection. For example, 0.05 implies that the largest impurity peak does not exceed 5% of the largest peak in the original PXRD pattern.
| Intensity level | Average number of peaks |
|---|---|
| 0–0.001 | 363.3 |
| 0.001–0.005 | 104.0 |
| 0.005–0.01 | 19.1 |
| 0.01–0.05 | 18.9 |
| 0.05–0.1 | 3.7 |
Showing how the 1D-CNN models trained on the ICSD/CSD database significantly outperform null predictions
PWB reports the percentage of testing examples which have all three length parameters within a given MAPE bound. PWB10, for example, indicates the percentage of testing examples for which all three predicted lattice parameters are within 10% of their true lattice parameters.
| Cubic | Hexagonal | Trigonal | Tetragonal | Orthorhombic | Monoclinic | Triclinic | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PWB | ML | Null | ML | Null | ML | Null | ML | Null | ML | Null | ML | Null | ML | Null |
| PWB1 | 19.1 | 1.3 | 1.9 | 0.0 | 1.4 | 0.0 | 0.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.1 | 0.1 |
| PWB5 | 78.3 | 2.9 | 34.1 | 0.0 | 15.0 | 0.0 | 15.5 | 0.0 | 9.6 | 0.0 | 5.9 | 0.0 | 50.1 | 0.6 |
| PWB10 | 91.3 | 15.4 | 60.7 | 0.0 | 32.3 | 0.0 | 35.2 | 0.0 | 34.5 | 1.1 | 23.6 | 1.8 | 83.4 | 4.4 |
| PWB20 | 96.6 | 23.8 | 79.6 | 0.6 | 53.9 | 2.0 | 56.3 | 2.4 | 64.5 | 10.8 | 51.0 | 14.9 | 93.9 | 28.5 |
| PWB30 | 98.1 | 32.4 | 86.1 | 15.5 | 67.0 | 8.3 | 70.9 | 8.8 | 80.3 | 29.7 | 69.9 | 36.7 | 97.7 | 54.2 |
| PWB40 | 98.6 | 44.0 | 90.7 | 15.5 | 75.3 | 18.1 | 81.9 | 21.0 | 87.8 | 50.1 | 83.7 | 58.0 | 98.8 | 72.7 |
| PWB50 | 99.6 | 50.8 | 94.3 | 29.4 | 82.4 | 32.4 | 88.1 | 35.8 | 93.6 | 64.8 | 90.6 | 73.3 | 99.7 | 83.0 |
Volume of search space ratio for each crystal system
| Symmetry | PWB10 | PWB5 | VR10 | VR5 |
|---|---|---|---|---|
| Cubic | 91.3 | 78.3 | 369.1 | 2926.8 |
| Hexagonal | 60.7 | 34.1 | 139.4 | 557.7 |
| Trigonal | 32.3 | 15.0 | 242.5 | 969.4 |
| Tetragonal | 35.2 | 15.5 | 181.9 | 727.5 |
| Orthorhombic | 34.5 | 9.6 | 3908.7 | 31 369.3 |
| Monoclinic | 23.6 | 5.9 | 2139.1 | 17 017.4 |
| Triclinic | 83.4 | 50.1 | 1902.1 | 15 216.7 |
Description of case studies for automatic experiments using Lp-Search
| Example | Structure | Crystal system | Lattice parameters (Å, °) |
|---|---|---|---|
| 1 | F0.5Ga1.8251Mg0.9975O3.5 | Cubic | 8.292, 8.292, 8.292, 90, 90, 90 |
| 2 | C116H1O4·4CH4O·4H2O | Triclinic | 11.2927, 13.455, 37.9436, 83.672, 89.873, 80.841 |
| 3 | C48H62ErN7O2Si2 | Hexagonal | 13.1144, 13.1144, 57.64, 90, 90, 120 |
Time taken and percentage converged for Example 1 using the ML+Lp-Search method
Automatic performance is similar to that obtained using default Lp-Search. Converged refers to the fraction of time that Lp-Search converges to the true lattice parameters in 20 minimizations of 50 000 iterations. VR is the search space volume ratio for each lattice parameter bound.
| Lattice parameter range | Converged | 〈Time〉 (s) | σ(Time) (s) | VR |
|---|---|---|---|---|
| 10% bound | 1.0 | 0.12 | 0.02 | 7.85 |
| 20% bound | 1.0 | 0.204 | 0.025 | 3.93 |
| 50% bound | 1.0 | 0.21 | 0.0 | 1.57 |
| (3–2 | 1.0 | 0.28 | 0.07 | 1 |
Time taken and percentage converged for Example 2 using the ML+Lp-Search method
ML+Lp-Search is considerably faster than default Lp-Search performance. Converged refers to the fraction of time that Lp-Search converges to the true lattice parameters in 20 minimizations of 50 000 iterations. VR is the search space volume ratio for each lattice parameter bound.
| Lattice parameter range | Converged | 〈Time〉 (s) | σ(Time) (s) | VR |
|---|---|---|---|---|
| 10% bound | 1.0 | 37.00 | 27.79 | 7239 |
| 20% bound | 1.0 | 97.97 | 91.53 | 920 |
| 50% bound | 1.0 | 847.22 | 860.60 | 59 |
| (3–2 | 1.0 | 3989.61 | 3297.83 | 1 |
Time taken and percentage converged for Example 3 using the ML+Lp-Search method
ML+Lp-Search is considerably faster and converges more often than default Lp-Search. Converged refers to the fraction of time that Lp-Search converges to the true lattice parameters in 20 minimizations of 50 000 iterations. VR is the search space volume ratio for each lattice parameter bound.
| Lattice parameter range | Converged | 〈Time〉 (s) | σ(Time) (s) | VR |
|---|---|---|---|---|
| 10% bound | 1.0 | 44.8 | 38.5 | 404 |
| 20% bound | 0.85 | 185.6 | 142.5 | 101 |
| 50% bound | 0.50 | 354.2 | 155.1 | 16 |
| (3–2 | 0.0 | 1 |
Figure 4Percentage of times that Lp-Search converged to the correct answer as a function of percentage bound for 100 test samples for each crystal system.
ML+Lp-Search method applied to predict lattice parameters on synchrotron data automatically
An asterisk (*) next to a prediction indicates that the predicted lattice parameter is an integral multiple or divisor of the true lattice parameter.
| Material | Crystal system | Real | Real | Real | ML | ML | ML | ML/ | ML/ | ML/ |
|---|---|---|---|---|---|---|---|---|---|---|
| LaB6 | Cubic | 4.1568 | 4.1568 | 4.1568 | 4.0950 | 4.0950 | 4.0950 | 4.1584 | 4.1584 | 4.1584 |
| SiO2
| Trigonal | 4.9142 | 4.9142 | 5.4057 | 4.6304 | 4.6304 | 7.4401 | 4.9030 | 4.9030 | 5.4448 |
| (C4H5KO6)
| Orthorhombic | 7.6130 | 7.7872 | 10.6546 | 6.8423 | 9.8300 | 11.5349 | 7.6128 | 7.7871 | 10.6544 |
| ZnO | Hexagonal | 3.2483 | 3.2483 | 5.2041 | 3.1494 | 3.1494 | 5.5284 | 3.2485 | 3.2485 | 5.2044 |
| In2O3 | Cubic | 10.1146 | 10.1146 | 10.1146 | 10.1178 | 10.1178 | 10.1178 | 10.1152 | 10.1152 | 10.1152 |
| Fe2O3
| Hexagonal | 5.0329 | 5.0329 | 13.7420 | 4.1460 | 4.1460 | 9.5030 | 5.0329 | 5.0329 | 6.8710* |
| CaCO3 | Hexagonal | 4.9865 | 4.9865 | 17.0609 | 4.8977 | 4.8977 | 8.9315 | 4.9876 | 4.9876 | 8.5330 |
| NaHCO3 | Monoclinic | 3.4800 | 9.6844 | 8.0555 | 5.4356 | 6.9519 | 9.4572 | 6.7952 | 12.2167 | 7.9012 |
| NaCl | Cubic | 5.6411 | 5.6411 | 5.6411 | 5.0638 | 5.0638 | 5.0638 | 5.6448 | 5.6448 | 5.6448 |
| KCl | Cubic | 6.2933 | 6.2933 | 6.2933 | 6.6645 | 6.6645 | 6.6645 | 6.2978 | 6.2978 | 6.2978 |
| Na2S2O3 5H2O | Monoclinic | 5.9501 | 7.5349 | 21.6000 | 6.4716 | 10.2068 | 15.2701 | 5.9553 | 15.2145 | 22.0355 |
| MgCl2 6H2O | Monoclinic | 6.0748 | 7.1084 | 9.8619 | 4.8602 | 6.5877 | 7.5698 | 6.0751 | 7.1089 | 9.8626 |
| (CH6N)2PbI3Cl | Orthorhombic | 4.6447 | 15.4300 | 19.2880 | 7.3973 | 14.7956 | 23.1625 | 9.2914* | 15.4322 | 19.2914 |
| KHCO3 | Monoclinic | 3.7131 | 5.6299 | 15.1794 | 4.7047 | 9.6333 | 16.1360 | 6.2625 | 9.8191 | 24.1920 |
| C | Cubic | 3.5656 | 3.5656 | 3.5656 | 2.7742 | 2.7742 | 2.7742 | 3.5655 | 3.5655 | 3.5655 |
α-Quartz.
Naturally occurring potassium bitartrate.
Contains 7.5 wt% Fe3O4 impurity.
Contains 5.4 wt% of methylammonium chloride and 0.7 wt% of (CH6N)PbI3 impurities (Kim et al., 2020 ▸).
Diamond.