Literature DB >> 32148861

The predictive power of data-processing statistics.

Melanie Vollmar¹, James M Parkhurst^1,2, Dominic Jaques¹, Arnaud Baslé³, Garib N Murshudov², David G Waterman^4,5, Gwyndaf Evans¹.

Abstract

This study describes a method to estimate the likelihood of success in determining a macromolecular structure by X-ray crystallography and experimental single-wavelength anomalous dispersion (SAD) or multiple-wavelength anomalous dispersion (MAD) phasing based on initial data-processing statistics and sample crystal properties. Such a predictive tool can rapidly assess the usefulness of data and guide the collection of an optimal data set. The increase in data rates from modern macromolecular crystallography beamlines, together with a demand from users for real-time feedback, has led to pressure on computational resources and a need for smarter data handling. Statistical and machine-learning methods have been applied to construct a classifier that displays 95% accuracy for training and testing data sets compiled from 440 solved structures. Applying this classifier to new data achieved 79% accuracy. These scores already provide clear guidance as to the effective use of computing resources and offer a starting point for a personalized data-collection assistant. © Melanie Vollmar et al. 2020.

Entities: Chemical Disease Species

Keywords: X-ray crystallography; experimental phasing; machine learning; macromolecular crystallography; phasing; structure determination

Year: 2020 PMID： 32148861 PMCID： PMC7055369 DOI： 10.1107/S2052252520000895

Source DB: PubMed Journal: IUCrJ ISSN： 2052-2525 Impact factor: 4.769

Introduction

Protein crystallography

For more than half a century, X-ray diffraction has been used to investigate protein crystals and the resulting diffraction images have been analysed to reveal the underlying structure of the protein to atomic detail. Despite well established techniques and dedicated user facilities, the vast majority of recorded diffraction data do not yield a protein structure (http://biosync.sbkb.org). For example, in 2016 it is estimated that less than 7% of diffraction data measured at European synchrotrons resulted in structures deposited in the Protein Data Bank (PDB; Berman et al., 2000 ▸). This calculation is based on an average collection time of 5 min per data set and assuming 200 operational days a year with 23 h of runtime per day. The possible factors affecting whether data lead to a structure deposition or not are manifold: (i) the crystal material comprising the purified protein and the additional chemicals used to crystallize it; (ii) the beamline hardware and capabilities, which define the experiments that can be carried out; (iii) the data-collection strategy, which is determined based on (i) and (ii); and (iv) intensity integration and assessment of the quality of the measured data as well as phase estimation, the latter finally determining whether a data set results in a structure or not. Each of these factors can be represented by one or more metrics, in particular those describing the protein and those derived from data analysis. Use of these metrics offers a unique opportunity to predict the usefulness of a given data set, i.e. whether or not it will result in an atomic structure. In this publication, we use machine learning and commonly applied statistical methods to analyse quality metrics from data analysis combined with protein sequence information. This serves as a basis for developing an interactive user guide to help crystallographers assess their data sets in order to determine which should be put forward for full analysis and structure solution using experimental phasing for phase estimation (Drenth, 1999 ▸; Dauter et al., 2002 ▸; Blow & Rossmann, 1961 ▸; Blundell & Johnson, 1976 ▸). It is hoped that such a tool will enable structural biologists to better plan experiments and improve upon the estimated 7% success rate.

Machine learning

Machine learning is part of the field of artificial intelligence. It uses statistical methods to develop algorithms which allow a computer to ‘learn’ in a data-driven manner and make predictions based on the learned information (Kohavi & Provost, 1998 ▸). ‘Learning’ implies that a task or prediction has not been hard-coded by a programmer in advance (Bishop, 2006 ▸). The main purpose of machine learning is to identify patterns in given training data and to predict an outcome for any new data based on the learned pattern. The input data are usually held in a database, here METRIX_DB, and can be extracted in a tabular fashion, with columns and their headers giving the characteristics/features/dimensions of the data and each row representing a sample. Commonly, the data are split randomly into training and test sets, with the former being used to train a machine-learning algorithm and the latter being used to assess the performance of the finalized, trained model. Generally, k-fold cross-validation against the training set is performed to highlight any overfitting, which is monitored through classification accuracy. In supervised learning, the data have been annotated with labels of the known result, here representing two classes in a classification problem, and an equal distribution of class sizes is desirable. A confusion matrix is used for performance assessment, giving details about correctly identified positive (true positive; TP) and negative (true negative; TN) samples as well as wrong classifications (false positive, FP; false negative, FN). These classification outcomes are the basis on which to calculate additional metrics (classification accuracy and error, sensitivity, specificity, false-positive rate, precision, F 1 score). Additionally, the area under a curve of a receiver operating characteristic (ROC) curve is calculated. A classification error of 5% is often used as a benchmark, as this is the typically observed human classification performance (Dodge & Karam, 2017 ▸). In a pre-assessment step the most important features in decision making are identified using statistical tools such as Pearson’s linear correlation coefficients and recursive feature elimination. The use of this subset of features for classifier training improves the stability and performance of the classifier and reduces computation time (Pyle, 1999 ▸; Pang-Ning et al., 2006 ▸; Guyon & Elisseeff, 2003 ▸). Training a classifier to create a predictive model is then an iterative process of training, testing and assessment until the desired stability and performance are reached. In this case study, we focused on supervised learning to solve a binary classification problem, namely the likelihood of experimental phasing success (class label ‘1’ or positive) or failure (class label ‘0’ or negative). The algorithms used to create trained models are decision trees, random forest classifiers and their derivatives (Breiman et al., 1984 ▸), and support vector machines (SVMs; Cortes & Vapnik, 1995 ▸).

Methods

METRIX_DB database

For the project that is described here, a database called METRIX_DB was created using the SQLite3 programming language accessed through a standard library within Python. At the time of writing, the database held the details of 810 released PDB structures. The diffraction images for these structures have been curated to match the set that was used to determine the published three-dimensional coordinates. At the moment, these structures stem from two structural genomics projects: 303 from the Structural Genomics Consortium (SGC; https://www.thesgc.org) at Oxford University, England and 507 from the Joint Center for Structural Genomics (JCSG; http://www.jcsg.org) at Stanford Synchrotron Radiation Lightsource, USA. We acknowledge that by using structures from two major laboratories, their distribution may not be entirely representative of the PDB. For 364 of these structures the data were collected as ‘native’ and for 446 the data collection produced an anomalous MAD or SAD experiment. The data were acquired at both synchrotron and in-house facilities and therefore also cover a range of detectors, i.e. photon-counting and CCD cameras, as well as X-ray sources. The resolution for the structures ranges from 1.05 to 3.8 Å; soluble and membrane proteins are covered as well as proteins in complexes with other proteins, peptides or nucleic acids. The anomalous scatterer used in experimental phasing was introduced by means of protein production in a selenium-enriched medium to create selenomethionine (SeMet) in nearly all cases. The metadata for these 810 structures were retrieved from the published PDB files and stored in METRIX_DB. Additional information was created when carrying out data integration and reduction, experimental phasing and sequence analysis. Where multiple wavelengths were available for a structure, the data set for each wavelength is considered a separate sample. After all of the data had been collected in the different tables of the database, individual columns containing the information of interest were selected and combined into a new table and exported as a file of comma-separated values which could directly be used in machine learning. The code for the database can be found at https://github.com/ccp4/metrix-database.

Data-reduction and phasing pipeline

Many of the details about the various data sets to be used in machine learning are statistics created during data reduction and phasing. Rather than executing these computational steps in a serial manner, a processing framework has been created using the Python 2.7 programming language to streamline the process using a compute cluster. A series of scripts has been developed to run xia2 (Winter, 2010 ▸) using DIALS (Winter et al., 2018 ▸) and AIMLESS (Evans, 2006 ▸) for diffraction-image integration and data reduction. The statistics recorded in METRIX_DB are averages over the entire resolution range for a data set of a given wavelength. Although it is recognized (Usón & Sheldrick, 2018 ▸) that experimental phasing success can be sensitive to the high-resolution cutoff used, we chose not to investigate the resolution-dependence of the quality metrics included here owing to the increase of complexity for this proof-of-principle study. Only samples for which data reduction was successful were taken forward into experimental phasing. For experimental phasing the SHELXC/D/E pipeline (Sheldrick, 2010 ▸) was used. If one wavelength was identified during data reduction, even if the data were collected as ‘native’, then a SAD experiment was assumed and phasing was carried out accordingly. If more than one wavelength was identified, the data were phased as a MAD experiment. Only samples for which the phasing software exited without error were used for machine learning and assigned a label, either ‘1’ or ‘0’, respectively, depending on whether the protein backbone could be traced or not. ‘Native’ data were not automatically assigned with label ‘0’, as several examples originally phased through molecular replacement also exhibited an anomalous signal strong enough for experimental phasing owing to intrinsic metals, for example in the active site. With the exception of a hold-out set used to calibrate the best classifier, the concession was made to not check data-reduction and phasing output in depth, or optimize input parameters for each structure, in order to be able to run computations on a computational cluster and hence in a time-efficient manner. A total of 703 samples were used for training and testing the classifiers, and a further 34 for calibration before predicting with new samples.

Protein

The sequence of each protein published alongside the structure was retrieved from the PDB and used for various calculations. For each sequence the molecular weight and number of atoms was calculated and stored in METRIX_DB. Using the unit-cell dimensions, molecular weight and the MATTHEWS_COEF tool from the CCP4 suite (Winn et al., 2011 ▸; Matthews, 1968 ▸), the most likely number of molecules in the asymmetric unit was determined as well as the unit-cell volume and the solvent content. The number of anomalous scatterers expected to be present in the structure was determined by counting the methionines in the sequence and was multiplied by the number of molecules most likely to be found in the asymmetric unit. Overall, this gave reasonably good estimates for most samples, but did fail in cases of proteins in complexes and a few cases in which the anomalous scatterer was not selenium.

New test data

The data used in this challenge were provided by the protein crystallography group at the University of Newcastle, England. None of the proteins analysed were present in the training or testing data. For 12 samples, the data collections were carried out on beamlines I03, I04, I04-1 and I24 at Diamond Light Source using a PILATUS detector. Data measured with this type of detector were available in the training and test sets. A further 12 samples were from a recent data collection on I04 using its new hardware setup of an EIGER detector and a multi-axis goniometer, for which no data were available in the training and test sets. The new diffraction data were integrated in the same way as the training and test data. The machine-learning aspect of this publication is based on Python 3.6. Other packages used are pandas 0.23.0 (McKinney, 2010 ▸), Matplotlib 2.2.3 (Hunter, 2007 ▸), SciPy 1.1.0 (Oliphant, 2007 ▸), mlxtend 0.13.0 (Raschka, 2018 ▸), scikit-learn 0.20.0 (Pedregosa et al., 2011 ▸) and NumPy 1.14.3 (Oliphant, 2006 ▸). The code for the machine-learning component of this publication can be found at https://github.com/ccp4/metrix_ml. To ensure that the performance of a classifier is not biased to one particular class, it is important to have a balanced data set in which both classes are present equally. This also needs to be considered when splitting the data into training and testing sets, as the dominant class is likely to be more frequently found and will therefore skew the performance of any classifier to only be able to predict this class. However, in our case a balance between the classes was not achievable. Therefore, the split into training and testing data was stratified to ensure that the two sets are representative of the data, meaning that they maintain the class distribution. Here, we used a common split of 20% of data being assigned to the testing set and 80% remaining in a training set, while at the same time maintaining a class distribution of 66% for class ‘1’ and 33% for class ‘0’. A random seed has also been defined to ensure reproducibility when splitting the data in subsequent executions. Additionally, some classifiers allow weights to be assigned to the different classes to achieve a balance, which will be explored when carrying out a randomized search for the best hyperparameters (see below). This also applies to the hold-out set used to calibrate the classifier before prediction. Overfitting means that the algorithm learns the training data and predicts these cases very well, apparently producing very good performance measures. However, challenged with a new, unseen sample the algorithm performs badly and fails to generalize. k-fold cross-validation for the training set was used to address this problem. Crucially, if the class distribution is unbalanced then this needs to be reflected in the cross-validation folds as well. In this study, we used a threefold cross-validation. The testing set is only used for assessing the trained, hyperparameter-optimized model at the very end. For support vector machines, the data were standardized using the StandardScaler class from the pre-processing module of scikit-learn to scale to unit variance. For decision trees and random forest algorithms, however, standardization is not necessary. A full list of all 44 features investigated here is given in Supplementary Table S1. This also includes the custom column transformations discussed in Appendix A . All features investigated here were plotted against each other and their linear Pearson’s correlation coefficients, r, were determined using the corr() function in pandas. Any correlation coefficient which has an associated p-value of <0.05 can be considered to be meaningful and a correlation can be identified. The results were visualized in a correlation matrix with negative correlations (−1 to 0) coloured red and positive correlations (0 to 1) coloured blue. To quantify the correlation strength between two variables a coefficient of determination, r 2, was calculated. This gives the variance in the data of one dependent variable explained by the independent variable in the pair. An r 2 of >10% therefore indicates a weak correlation and an r 2 of >90% indicates a strong correlation. Only correlations that fulfil p < 0.05 and r 2 > 10% will be considered here. For support vector machines feature importances are not readily available. Recursive feature elimination was therefore used to identify the most likely number of features as well as those most important for decision making. This was monitored through changes in classification accuracy upon the withdrawal of a feature. The recursive feature-elimination function with cross-validation from the feature_selection module in scikit-learn was used for this assessment. The classifiers listed below have been used. For all classifiers the hyperparameters used have been determined in a randomized search using the RandomizedSearchCV function in scikit-learn, which tried 500 combinations for a given range. A basic scheme of training and assessment can be found in Supplementary Fig. S1. The following classifiers have been investigated for their suitability as a predictive tool for the available data: a support vector machine with a linear kernel and a radial-base function kernel, a decision tree, a decision tree with bagging, a decision tree with AdaBoost, a random forest and an extreme randomized forest. More details of the hyperparameter settings for the individual classifiers can be found in Appendix B . The hyperparameters used for the best classifier after identifying the most important features are given in Supplementary Table S2. The metrics used to assess the different classifiers are detailed in Appendix C . As mentioned previously, the data used in this study are unbalanced regarding class distributions. Although this distribution was maintained by using a stratified split when separating the test and training sets, this imbalance still had an effect on how reliable the predicted class probabilities were. In order to address this problem, the best classifier was calibrated with a hold-out set that was neither part of the test nor the training set using the CalibrateClassifierCV function in scikit-learn with cv=‘prefit’. After calibrating the best classifier, the predict() function from the scikit-learn package was used to challenge it with new samples.

Comparison with an existing SAD prediction tool

The software tool plan_SAD_experiment (Terwilliger et al., 2016a ▸,b ▸) from the Phenix analysis package (version 1.17.1; Liebschner et al., 2019 ▸) was used to calculate the probability of success for a given wavelength, assuming that each wavelength represents a SAD data set. The tool was executed with the following parameters: phenix.plan_sad_experiment seq_file= atom_type=Se wavelength= resolution= data=, where wavelength and resolution have been queried from METRIX_DB, and mtz_file contains the integrated intensities for each wavelength of a given PDB entry.

Results

Based on the results of automated analysis, 440 structures successfully produced 703 samples, each of which is a crystallographic data set at a single wavelength. The class labels for these samples were verified through manual inspection of the automatic processing results, with 232 identified as class ‘0’ and 471 as class ‘1’. Information is held in METRIX_DB as a collection of tables, with each table relating to a stage of crystallographic data analysis, for example sequence details, data reduction, experimental phasing and the deposited PDB file information for reference. METRIX_DB is structured such that the number of samples and features to be investigated can easily be expanded. Measures have been put in place to expand the database in the future, with the aim to ultimately use the results from the Synchweb/ISPyB user interface (Fisher et al., 2015 ▸), which is used to manage data-collection results during a visit.

Pre-assessment of the data

Fig. 1 ▸ shows a correlation matrix for the linear Pearson’s correlation coefficients determined for all features. The features are grouped such that correlations between descriptors of diffraction data quality are concentrated in the top left quadrant and correlations between protein descriptors are located in the bottom right quadrant. The corresponding correlation coefficients (r), associated p-values and coefficients of determination (r) as quantitative measures of the correlation strength are reported in Supplementary Table S3.

Figure 1

Correlation matrix of Pearson’s correlation coefficients between feature pairs to identify linear correlations between them. All 44 features investigated have been plotted. Blue indicates positive linear correlation ranging from 0 to 1 and red indicates negative linear correlation ranging from −1 to 0. The intensity of the colour indicates the strength of the correlation. All numerical values can be found in Supplementary Table S3.

The highest scoring features for three decision trees, two random forests and the linear SVM classifiers are reported in Supplementary Table S4, with the corresponding performance statistics on the test set given in Supplementary Table S5. Features related to and extracted from experimental phasing software, such as CCweak, CCall and CFOM, were used in initial, exploratory work. These features were so dominant that the information provided from data integration and scaling statistics would vanish. However, the purpose of this study was to identify indicators for experimental phasing success at the step of data integration and scaling, so experimental phasing statistics were excluded from further analysis. The most important features in the decision-making process in the different classifiers and the frequency of appearance of a particular feature were counted and plotted in Fig. 2 ▸. For retraining the classifiers, the six highest scoring features were chosen: CCanom, ΔI/σI, m anom, d max, ΔF/F and f′′theor. Smaller and larger feature sets based on the scores plotted in Fig. 2 ▸, using one feature (CCanom), two features (CCanom, m anom; CCanom, I/σ), five features (CCanom, m anom, d max, ΔF/F, f′′theor) and seven features (CCanom, ΔI/σI, m anom, d max, ΔF/F, f′′theor, CC1/2), were also tried (data not shown), but none of the resulting classifiers performed as well as the decision tree with AdaBoost and the six highest scoring features.

Figure 2

Bar plot of feature occurrences found during the initial classifier training. Features that are important in the decision-making process during classification appear more frequently regardless of which classifier has been used. The highest scoring features for the individual classifiers can be found in Supplementary Table S4. The most frequently found features are CCanom, ΔI/σI, m anom, d max, ΔF/F, f′′theor and CC1/2.

Feature correlations

A closer look at the correlation matrix identifies high degrees of positive or negative correlation between the data-quality descriptors commonly used by crystallographers. These statistics either assess the precision of unmerged (R merge, R meas) or merged (I/σ, CC1/2, R p.i.m.) intensities. The pattern of correlations for R merge and R meas are very similar, supporting the view that the introduction of R meas renders R merge obsolete (Weiss & Hilgenfeld, 1997 ▸). For the merged intensity precision indicators, there are broadly similar patterns (with the sign of the correlation coefficient inverted for R p.i.m. compared with I/σ or CC1/2), but the differences between these patterns suggest that distinct information is expressed by these metrics. The relations between these quantities have been discussed elsewhere (Karplus & Diederichs, 2015 ▸). It should be mentioned that the spread of multiplicity, M, in METRIX_DB is limited and therefore a relationship with I/σ could not be explored without artificial truncation of data sets, which we did not perform. Indicators of anomalous signal in the data (ΔI/σI, ΔF/F, CCanom and m anom) are only weakly correlated to the theoretical anomalous scattering factor f′′theor, presumably because of other factors influencing the observed signal such as anomalous scatterer occupancy and B factor, and the overall signal-to-noise ratio in the data. The anomalous signal as expressed by ΔF/F is clearly reflected in strong correlations to R meas(I), R merge(I) and R p.i.m.(I) calculated assuming intensity equivalence of Bijvoet mates, i.e. higher values when compared with R merge(I+/I−), R meas(I+/I−) and R p.i.m.(I+/I−) that account for the presence of anomalous signal. Other commonly used metrics, such as N obstotal, N obsunique, d min and B, show lower level correlations to data-quality descriptors. d max, however, is typically defined by a backstop shadow rather than the intrinsic quality of the measured intensities and is uncorrelated to data-quality descriptors. Data completeness, T, is given here by a single value, which ignores the potentially relevant effect of systematic patterns of incompleteness, such as missing wedges, shadowed regions of the detector and a variable high-resolution cutoff at different regions of the detector. More detailed analysis such as this will require more sophisticated descriptors of completeness and a more representative database. There are weak correlations between descriptors of the protein content and data-quality indicators, for example MWSASU, which gives the ratio between molecular weight and the number of anomalous scatterers in the asymmetric unit, or I ASU, which represents the ratio between signal (I/σ) and asymmetric unit content (MWASU). As METRIX_DB expands it will be interesting to explore these relationships, but at this stage we avoid speculative interpretation. The pattern of correlations within the group of protein-content descriptors shows some larger features, as expected for metrics that are all used to quantify various aspects of the crystal content. Unit-cell parameters are weakly correlated with various parameters but display no strong predictive properties. Also visible, as one would expect, are correlations between space-group number (N sg), multiplicity (M) and I/σ through its relation to multiplicity.

Selecting the best-performing classifier

The reduced feature set identified above was used to retrain all classifiers, and their performance results on the test set are given in Supplementary Table S6. The best-performing classifier was a decision tree with AdaBoost, and its confusion matrix and radar plot are shown in Figs. 3 ▸(c) and 3 ▸(d), respectively. Additionally, the results for a perfect classifier are given for comparison [Figs. 3 ▸(a) and 3 ▸(b)].

Figure 3

Confusion matrices and radar plots for a perfect classifier (a, b), the best classifier, a decision tree with AdaBoost (c, d), and for new data (e, f) and the performance of the best classifiers on new data. The confusion matrices (a, c, e) give the scores for the four possible classification outcomes: true negative at the top left, true positive at the bottom right, false negative at the top right and false positive at the bottom left. The perfect classifier has no misclassifications, whereas the decision tree with AdaBoost places three class ‘0’ samples and four class ‘1’ samples into the wrong category. For the new data one sample has been identified as false positive and four as false negatives. The classification outcomes serve as a basis to calculate classification accuracy (ACC), classification error (Class Error), sensitivity (Sensitivity), specificity (Specificity), false-positive rate (FPR), precision (Precision) and F 1 score (F1 score) as they are plotted in the radar plots (b, d, f). The value ROC AUC is determined by calculating the area under the curve of an ROC curve.

This classifier is the best-performing classifier based on the assessment metrics used, achieving a classification accuracy of 95%. The sensitivity, or true-positive rate, was found to be 96% (90 out of 94 samples) and the specificity, or true-negative rate, was 94% (44 out of 47 samples). The false-positive rate was 6% (three samples) with precision 97%. The F 1 score was 96% and the area under the curve of an ROC curve was 99%.

Testing the prediction classifier against new data

The performance metric results for the new data using the decision tree with AdaBoost are given in Supplementary Table S7 and the corresponding confusion matrix and radar plot in Figs. 3 ▸(e) and 3 ▸(f). A total of 24 new samples were used to challenge the classifier. The samples comprised proteins not present in METRIX_DB, and the data-collection strategies and beamline hardware were entirely different to those used for the training data. These therefore presented a significant challenge for the classifier. The experimental outcomes for these new samples were assessed and labelled by the user and were only revealed after prediction had been carried out. A probability threshold of 80.0% for class ‘1’ was applied, reflecting the fact that users would typically prefer a low false-positive rate, i.e. have some confidence that class ‘1’ truly reflects a successful structure determination. The classification accuracy achieved was 79%. Sensitivity and specificity were 64% (seven out of 11 samples) and 92% (12 out of 13 samples), respectively. The false-positive rate was 8% (one out of 13 samples) with a precision of 86% and an F 1 score of 74%, and the area under the curve of a receiver operating characteristic curve (ROC AUC) was determined as 75%. The same samples, a total of 703, that were used as training and testing sets for machine learning were analysed by phenix.plan_SAD_experiment. A probability threshold of 80% for SAD phasing success was chosen as a measure of confidence in the prediction, as was performed for the new user sample (see Section 3.5). Overall, the tool achieved a classification accuracy of 68%. The vast majority of true-positive samples were correctly identified by the prediction tool, with a sensitivity of 97%. Many of the false-negative samples had a wavelength chosen for low-energy remote data collection as part of a MAD data set where there is weak or no anomalous signal but that was essential to solve the phase problem. Of the true-negative samples, 21 were correctly identified, which is reflected in a false-positive rate of 91%. In comparison, the false-positive rate for our testing set is 5% and 8% for new user data. We stress that the results from the two approaches are not directly comparable as the tools are intended for different purposes. Phenix.plan_SAD_ experiment was designed to advise a user who has already chosen to try SAD phasing whether they are going to be successful, which it does very well based on the sensitivity of 97%. However, its purpose is not to identify data sets for which data were collected either as native or MAD, hence the false-positive rate of 91%.

Discussion

Analysing crystallographic results with the aim to predict the likely experimental phasing success using machine learning is a data-driven approach. As such, the outcome is defined by the kind of data that have been used in training. In this study, we have chosen to focus on particular experimental phasing approaches represented by a training database of native, SAD and MAD data sets. The content of METRIX_DB is currently limited to published structures where data are publicly available. Nearly all of the crystallographic data used here exhibited anomalous signal that made experimental phasing straightforward. A post-mortem analysis of a collection of weak S-SAD data sets is under way with the aim of including such data in METRIX_DB. Ultimately, representative data from each kind of data collection performed by users needs to be included in METRIX_DB. This should reduce the constraints currently imposed on the content of METRIX_DB and therefore on the scope of our studies. Additionally, this would close the technology gap between the data currently measured on modern X-ray beamlines and those contained in METRIX_DB. For example, current synchrotron data sets are almost exclusively measured using photon-counting hybrid pixel detectors and fine-slicing methods. However, our analysis clearly provides initial insight into the potential application of machine learning in protein crystallography to assist a scientist during decision making in experimental phasing. For future investigations METRIX_DB will be expanded to make use of other descriptors, for example, results from analysis and prediction tools making use of the protein sequence. Furthermore, recent changes in data policy for many European synchrotrons will allow user data to be incorporated into training databases, making them more relevant and effective. Clearly, the highest scoring features identified here, ΔI/σI, ΔF/F, CCanom, m anom, d max and f′′theor, should be optimized by a crystallographer prior to, or during, data collection and analysis, whether or not a classifier is being used to provide guidance. For example, to maximize f′′theor a wavelength scan should be carried out prior to data collection to select an optimal wavelength. To optimize d max an additional low-resolution pass could be collected and/or the beamstop size and position could be set to ensure low-resolution data coverage. Regarding ΔI/σI, ΔF/F, CCanom and m anom it would be advisable to look at the classifier prediction for experimental phasing success and continue to collect additional rotation images in order to increase anomalous signal while monitoring radiation damage. Alternatively, data collected from several crystals of the same protein with the same anomalous scatterer can be combined. The matrix of Pearson’s linear correlation coefficients showed that a subset of data-quality metrics [R merge, R meas, R p.i.m., R merge(I+/I−), R meas(I+/I−) and R p.i.m.(I+/I−)] are highly correlated with each other and hence convey very similar information. This gives additional support to previously published analysis (Karplus & Diederichs, 2012 ▸; Diederichs & Karplus, 2013 ▸; Evans & Murshudov, 2013 ▸) describing the relationships between these metrics. Conducting an in-depth analysis of the resolution dependence of many of the metrics investigated here, in particular those identified as being most important when judging the likelihood of experimental phasing success using a machine-learning tool, was not within the scope of this manuscript but will be part of further work. In our study, all of the statistics were averages across the entire resolution range of a given sample, where the high-resolution limit was set by the data-integration software. Generally, phasing techniques do not use the full resolution range of data but typically truncate the data to a lower resolution limit for substructure determination. Using a systematic approach by applying common cutoffs to all data sets would therefore be useful in identifying the resolution range that gives the highest chance of success, regardless of the actual resolution limit of the data. The best classifier as judged by its performance metrics presented here is a decision tree with AdaBoost. With a classification error of 5%, this classifier performs at about the same level as a human would when presented with the test data (Dodge & Karam, 2017 ▸). The small number of false positives and false negatives, four and three, respectively, should allow the classifier to generalize when challenged with a novel sample. This is further detailed below. Many of the data sets present in the training and test sets were measured with CCD and photon-counting hybrid pixel-array detectors (PADs) using typical crystal rotation ranges per image of >0.5° (wide-slicing). The 24 data sets used to test the classifier were, however, measured at Diamond Light Source on PADs using fine-slicing (typically <0.1°). This difference in data-collection approach may be one reason for the lower classification accuracy of 79%. In general, however, one would always expect a reduction in accuracy for any classifier when used with new data. Surprisingly, perhaps, the classifier performs well in correctly identifying samples where experimental phasing is likely to fail, class ‘0’, with a specificity of 92%. A broader representation of detector types and data-collection strategies in METRIX_DB would be likely to result in better classifier performance against new data and could highlight other high-scoring features to optimize. Similarly, the use of different anomalous scatterers in the diffraction experiment needs to be considered since all derivatized training samples here were selenomethionine proteins. Samples 9 to 12 of the new user data were heavy-atom soaks using platinum, gold or lead compounds. Samples 9 and 10 were correctly classified as class ‘0’, since a lack of anomalous signal meant experimental phasing failed. This was probably owing to poor incorporation of the heavy atoms during soaking. Of the remaining two samples, one was classified correctly as class ‘1’ (sample 11). Although attempted, a direct comparison with the already available phenix.plan_SAD_experiment is not justified as this tool was specifically designed to help crystallographers maximize their chances of solving the phase problem with a SAD experiment. However, our machine-learning approach is designed to work in a more general way by looking at the data measured for different phasing methods. An implementation of an interactive user guide can be envisaged in which the classifier is trained with standard data sets, makes predictions on incoming data collections and reports results to the user. Feedback can be given directly through the Synchweb/ISPyB interface as part of the general data-analysis workflow. After assessing their results, either while still at the beamline or later after more careful analysis, users annotate the data through Synchweb/ISPyB with the actual experimental outcome by simply clicking on a box to set a label. This data would then be included in METRIX_DB or extracted directly from Synchweb/ISPyB to retrain the classifier. The retraining process itself would be carried out during shutdown periods when no new user data are acquired or between visits, depending on computational resources. Over time, such a classifier would be customized towards the proteins investigated by a certain user group and their typical data-collection experimental phasing strategies. A classifier would become more stable and the training frequency can then be reduced. The flowchart in Fig. 4 ▸ gives a schematic outline for an interactive user system. Although a user will always be able to ignore the recommendations and trigger data analysis manually, including our trained algorithm in the analysis pipelines is expected to help in balancing the workload on the computing infrastructure in a more intelligent way than the brute-force approach currently in use.

Figure 4

General workflow envisaged for an interactive user assistant. Blue depicts the different steps in structure solution from diffraction data collection to experimental phasing. Dark purple gives the feedback and statistics of every step, which is stored in the database METRIX_DB. Green represents the statistics stored in METRIX_DB which are used to train the classifier in METRIX_ML.

Additionally, we envision a system in which a classifier executes repeat predictions on incomplete data while data collection is still ongoing to indicate a trend of success and to identify the point at which the data are sufficient to attempt experimental phasing. This would be very beneficial, for example, in the case where multiple partial rotation data sets are being collected and combined. Post-mortem analysis regarding such an application is under way using S-SAD data. We have presented a proof of principle for how machine learning can be used in protein crystallography, in particular for experimental phasing, and have discussed the possible applications of such predictive classifiers. This concept will be generalized in the future to cover a broader range of structure-determination methods including isomorphous replacement-related methods and molecular replacement. This will require a substantial expansion of METRIX_DB. Although intervention by an expert crystallographer is still essential for corner cases, such machine-learning support systems will become more and more important. The data rates and data volumes accumulated during diffraction experiments are already such that it is difficult for a human to keep pace. Furthermore, the number of scientists who are using protein crystallography as an analytical tool rather than a scientific discipline is rapidly increasing, placing a greater burden on automated acquisition and analysis systems at user facilities. For these reasons, it is expected that decision-making tools based on machine learning will form an integral part of macromolecular crystallography beamline facilities in the future.

Related literature

The following references are cited in the supporting information for this article: Arndt et al. (1968 ▸), Bijvoet et al. (1951 ▸), Diederichs & Karplus (1997 ▸), Howell & Smith (1992 ▸), Schneider & Sheldrick (2002 ▸), Srinivasan & Parthasarthy (1976 ▸), Weiss (2001 ▸) and Wilson (1942 ▸, 1949 ▸, 1950 ▸). Supplementary tables and figure. DOI: 10.1107/S2052252520000895/jt5042sup1.pdf

18 in total

The predictive power of data-processing statistics.

Introduction

Protein crystallography

Machine learning

Methods

METRIX_DB database

Data-reduction and phasing pipeline

Protein

New test data

Comparison with an existing SAD prediction tool

Results

Pre-assessment of the data

Feature correlations

Selecting the best-performing classifier

Testing the prediction classifier against new data

Discussion

Related literature

1. Jolly SAD.

2. Substructure solution with SHELXD.

3. Improved R-factors for diffraction data analysis in macromolecular crystallography.

4. Solvent content of protein crystals.

5. A method of comparing the areas under receiver operating characteristic curves derived from the same cases.

6. The minimum crystal size needed for a complete diffraction data set.

7. Better models by discarding data?

8. How good are my data and what is the resolution?

9. Can I solve my structure by SAD phasing? Anomalous signal in SAD phasing.

10. An introduction to experimental phasing of macromolecules illustrated by SHELX; new autotracing features.

1. Pre- and Post-publication Verification for Reproducible Data Mining in Macromolecular Crystallography.

2. Predicting the performance of automated crystallographic model-building pipelines.

3. Predicting protein model correctness in Coot using machine learning.