| Literature DB >> 32148861 |
Melanie Vollmar1, James M Parkhurst1,2, Dominic Jaques1, Arnaud Baslé3, Garib N Murshudov2, David G Waterman4,5, Gwyndaf Evans1.
Abstract
This study describes a method to estimate the likelihood of success in determining a macromolecular structure by X-ray crystallography and experimental single-wavelength anomalous dispersion (SAD) or multiple-wavelength anomalous dispersion (MAD) phasing based on initial data-processing statistics and sample crystal properties. Such a predictive tool can rapidly assess the usefulness of data and guide the collection of an optimal data set. The increase in data rates from modern macromolecular crystallography beamlines, together with a demand from users for real-time feedback, has led to pressure on computational resources and a need for smarter data handling. Statistical and machine-learning methods have been applied to construct a classifier that displays 95% accuracy for training and testing data sets compiled from 440 solved structures. Applying this classifier to new data achieved 79% accuracy. These scores already provide clear guidance as to the effective use of computing resources and offer a starting point for a personalized data-collection assistant. © Melanie Vollmar et al. 2020.Entities:
Keywords: X-ray crystallography; experimental phasing; machine learning; macromolecular crystallography; phasing; structure determination
Year: 2020 PMID: 32148861 PMCID: PMC7055369 DOI: 10.1107/S2052252520000895
Source DB: PubMed Journal: IUCrJ ISSN: 2052-2525 Impact factor: 4.769
Figure 1Correlation matrix of Pearson’s correlation coefficients between feature pairs to identify linear correlations between them. All 44 features investigated have been plotted. Blue indicates positive linear correlation ranging from 0 to 1 and red indicates negative linear correlation ranging from −1 to 0. The intensity of the colour indicates the strength of the correlation. All numerical values can be found in Supplementary Table S3.
Figure 2Bar plot of feature occurrences found during the initial classifier training. Features that are important in the decision-making process during classification appear more frequently regardless of which classifier has been used. The highest scoring features for the individual classifiers can be found in Supplementary Table S4. The most frequently found features are CCanom, ΔI/σI, m anom, d max, ΔF/F, f′′theor and CC1/2.
Figure 3Confusion matrices and radar plots for a perfect classifier (a, b), the best classifier, a decision tree with AdaBoost (c, d), and for new data (e, f) and the performance of the best classifiers on new data. The confusion matrices (a, c, e) give the scores for the four possible classification outcomes: true negative at the top left, true positive at the bottom right, false negative at the top right and false positive at the bottom left. The perfect classifier has no misclassifications, whereas the decision tree with AdaBoost places three class ‘0’ samples and four class ‘1’ samples into the wrong category. For the new data one sample has been identified as false positive and four as false negatives. The classification outcomes serve as a basis to calculate classification accuracy (ACC), classification error (Class Error), sensitivity (Sensitivity), specificity (Specificity), false-positive rate (FPR), precision (Precision) and F 1 score (F1 score) as they are plotted in the radar plots (b, d, f). The value ROC AUC is determined by calculating the area under the curve of an ROC curve.
Figure 4General workflow envisaged for an interactive user assistant. Blue depicts the different steps in structure solution from diffraction data collection to experimental phasing. Dark purple gives the feedback and statistics of every step, which is stored in the database METRIX_DB. Green represents the statistics stored in METRIX_DB which are used to train the classifier in METRIX_ML.