Literature DB >> 34056423

Leveraging Uncertainty from Deep Learning for Trustworthy Material Discovery Workflows.

Jize Zhang¹, Bhavya Kailkhura¹, T Yong-Jin Han².

Abstract

In this paper, we leverage predictive uncertainty of deep neural networks to answer challenging questions material scientists usually encounter in machine learning-based material application workflows. First, we show that by leveraging predictive uncertainty, a user can determine the required training data set size to achieve a certain classification accuracy. Next, we propose uncertainty-guided decision referral to detect and refrain from making decisions on confusing samples. Finally, we show that predictive uncertainty can also be used to detect out-of-distribution test samples. We find that this scheme is accurate enough to detect a wide range of real-world shifts in data, e.g., changes in the image acquisition conditions or changes in the synthesis conditions. Using microstructure information from scanning electron microscope (SEM) images as an example use case, we show that leveraging uncertainty-aware deep learning can significantly improve the performance and dependability of classification models.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34056423 PMCID： PMC8154239 DOI： 10.1021/acsomega.1c00975

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Deep learning (DL) techniques are achieving remarkable success in a wide range of scientific applications.[1] One such application is Material Informatics that applies DL methods to accelerate the discovery, synthesis, optimization, and deployment of materials.[2] As a motivating example, consider the problem of Material Discovery where one is interested in screening novel materials that meet certain performance requirements.[3] A critical obstacle in developing and deploying materials in a timely manner is understanding the complex process, structure, property, and performance (PSPP) relationships, including evaluating the property of interest of a material, which can take a significant amount of time and effort. DL offers opportunities to potentially accelerate this process by learning the complex structure–property relationships (e.g., materials’ physical, mechanical, optoelectronic, and thermal properties) from historical characterization and property data (i.e., training data) and produce models that can predict material properties at dramatically lower overall costs on unseen test data. Thus, DL is becoming increasingly prevalent in Material Informatics workflows for making potentially important decisions, thereby making applications of DL approaches in Material Discovery application a high stake in nature. For example, incorrect decisions or predictions in Material Discovery can have (a) significant costs in terms of experimental resources and time when testing bad-performing materials recommended by DL or (b) lost opportunities to discover high-performing materials rejected by DL. The majority of efforts in Material Informatics are currently devoted to training deep neural networks (DNNs) that can achieve high accuracy on holdout data set from the training distribution.[4,5] Existing efforts implicitly assume ideal conditions during both training and testing by assuming (a) having access to a sufficiently large labeled training data and (b) test data from the “same distribution” as the training set. Unfortunately, these conditions are seldomly met in Material Discovery applications since a large amount of relevant historical data is rarely available and the test data is typically and systematically different from the training data either through noise[6] or other changes in the distribution.[7] This is an inherent challenge in applying DL to scientific domain whose aim is to find something “new” and “different” that outperforms existing materials. It is well known that DL models are highly susceptible to such distributional shifts, which often leads to unintended and potentially harmful behavior (i.e., overconfident on predictions), especially when trained with insufficient amount of data.[8] Therefore, to ensure that the trained models behave reliably and to establish DL as a dependable solution in high-stake Material Discovery applications, the following equally important questions need to be answered.

How Much Data Is Required to Train a DNN?

The quality and amount of training data are often the most important factors that determine the accuracy of a DL model. However, determining the amount of data necessary to train a DNN to achieve high classification accuracy is a challenging problem. A reasonable approach is to model validation accuracy as a function of the amount of training data to predict the optimal data size needed to achieve a certain classification performance (referred to as the learning curve-based approach).[9] Unfortunately, to get a reliable estimate of validation accuracy, this approach requires a large amount of labeled data not used in the training process. This is clearly not feasible for the application of interest where labeling process (e.g., performing experiments to collect relevant data) is expensive and time consuming. To overcome this challenge, we propose to leverage the DNN prediction uncertainty (or confidence) information and approximate the validation accuracy in the learning curve using unlabeled data only. Specifically, we obtained the average confidence (i.e., the predicted probability of the most plausible label) over the set of unlabeled data as the estimated validation accuracy for generating the learning curve. We show that having access to unlabeled scanning electron microscopy (SEM) images is sufficient in our approach to determine with high precision how many more material samples should be imaged and collected to provide sufficient DL training data to achieve the desired accuracy label.

How to Equip DNNs with a Reject Option?

The performance and dependability of DNN classification models can be dramatically improved by building in a reject option. Testing the DL model on difficult (or ambiguous) material samples is a common case in material science applications since the discovery of new materials often involves moving away from existing materials using different synthesis and processing conditions. In such a context, having a reject option allows DNN models to refer a subset of difficult material samples for further evaluation and testing while making predictions on the rest. This is also known as selective classification where the classifier abstains from making a decision when the model is not confident while keeping coverage as high as possible. An added benefit to deploying selective classification is that this approach has the potential to result in substantial improvements in the performance of the remaining material data. Unfortunately, many off-the-shelf ML models are not well calibrated, i.e., the probability associated with the predicted class label does not match the probability of such prediction being correct.[10,11] It has been observed that this issue is more noticeable in complex models such as DNNs.[12] Intuitively, this observation implies that DNNs are particularly bad at recognizing ambiguous samples. To overcome this challenge, we propose to use state-of-the-art uncertainty-aware DNNs[13,14] in Material Discovery workflows as they are known to be better calibrated as compared to their baseline counterparts. Specifically, we show that uncertainty-guided decision referral can substantially improve the classification accuracy on the nonreferred samples while reducing the number of referred (i.e., rejected) samples.

How to Make DNNs Recognize Out-of-Distribution (OOD) Examples?

Test data in real-world DL applications usually deviates from the training data distribution, e.g., due to sample selection bias, nonstationarity, and noise corruptions in some cases.[8] For example, in the material science application (i.e., analysis of SEM images), the deviation between train and test data may arise due to (a) changes in the equipment used to capture the training material samples or (b) changes in the material synthesis process. Having plentiful training data that can potentially cover all possibilities and variance is ideal, but in practice, that is rarely attainable. Thus, a desirable feature in Material Discovery workflow is for a model to be aware and not be very confident on test data that is very far (or different) from the data used to train it. For example, a potentially novel material that is different from all training data should result in a request for expert help rather than misclassification into a known material class as the detection of undiscovered material is in fact a task of significant interest. In other words, we need to require DNNs to be aware of cases when data acquisition setup or synthesis conditions are so different from those used during training that DL predictions cannot be trusted. Unfortunately, DNNs often make overconfident misclassification in the presence of distributional shifts and out-of-distribution (OOD) data.[15] Accurate predictive uncertainty is a highly desirable property in such cases as it can help practitioners to assess the true performance and risks to decide whether the model predictions should (or should not) be trusted.[16] Similar to the second question, we propose to leverage the predictive uncertainties produced by the state-of-the-art uncertainty-aware DNNs to overcome this challenge in Material Discovery workflows. Specifically, we show that predictive uncertainty of uncertainty-aware DNNs is indicative enough to differentiate among the in-distribution data, data captured with different equipment, and data generated with different synthesis conditions. To put these questions into context, we consider the problem of differentiating materials based on their microstructures as observed by their SEM images. This problem is formulated as an K-class classification problem where each class corresponds to a specific material, with unique characteristics determined by a unique set of synthesis parameters. We use deep neural networks (DNNs) to determine whether DL models can learn to differentiate materials purely based on their complex microstructures, which are often challenging by human assessment for samples with similar looking microstructures but very different mechanical test behaviors (Figure ). By determining the accuracy of the DL models along with prediction uncertainties, we may begin to understand what microstructures lead to samples with desired performance metrics and avoid extensive testing and evaluation for newly synthesized materials with high confidence predictions. By demonstrating that DL can featurize microstructural information from SEM images and classify each material with high accuracy, we can start to build confidence in DL models to screen, optimize, and discover materials in significantly shorter amount of time compared to an Edisonian approach.

Figure 1

Schematic of a workflow for SEM image-based material classification task.

Main Contributions

Deep learning-based solutions for Material Informatics applications have so far been suggested sometimes without considering these important questions listed in the previous section. Yet, techniques to ensure the dependability of automated decisions are crucial for integrating DL in Material Discovery workflows. The main contribution of this paper is to show that uncertainty-aware DL is a unified solution that is capable of answering all of these questions by leveraging predictive uncertainty of DNNs. We demonstrate the applicability of our technique by using DL models to classify the microstructural differences of a material based on their corresponding SEM images. The summary of our findings is as follows. First, we show that by leveraging predictive uncertainty, one can estimate classification accuracy at a given training sample size without relying on labeled data. This serves a general methodology for determining the training data set size necessary to achieve a certain target classification accuracy. Next, we show that predictive uncertainty can be a guiding principle to decide which material samples should be referred to a material scientist for further testing and evaluation instead of making a DL-based prediction. We find that this uncertainty-guided decision referral can dramatically improve the classification accuracy on the remaining (i.e., nonreferred) examples. Finally, we show that predictive uncertainty can be used to detect distributional changes in the test data. We find that this scheme is accurate enough to detect a wide range of real-world shifts, e.g., due to changes in the imaging instrument or changes in the synthesis conditions. Although we focus on a specific material application in this paper, the proposed methodology is quite generic and can be used to make the application of DL to a wide range of scientific domains dependable and trustworthy.

Methods

Data Sets

Our main data set is composed of SEM images of 30 different lots of 2,4,6-triamino-1,3,5-trinitrobenzene (TATB). Here, a lot refers to TATB crystals produced under a specified synthesis/processing condition. TATB is an insensitive high-explosive compound of interest for both the Department of Energy and Department of Defense.[17] After each lot has been synthesized (with different synthesis conditions), each lot is analyzed with a Zeiss Sigma HD VP SEM to produce high-resolution scanned images while holding image acquisition conditions (e.g., brightness and contrast settings) fixed across all lots/images. Each image tile consists of 1000 × 1000 pixels, with a corresponding field of view of 256.19 μm × 256.19 μm. The combined images captured for the 30 lots resulted in 59 690 grayscale SEM images. Thus, the labeled data of interest for DNNs corresponds to 59 690 grayscale SEM images labeled with unique designators per class (30 in total). Example SEM images for TATB for some classes are provided in Figure . One can notice strong visual discrepancy across SEM images from different lots (or classes).

Figure 2

Representative SEM images to illustrate the typical microstructural variability for different TATB lots. The varying particle size, porosity, polydispersity, and facetness can be clearly observed. Images have been processed to normalize image contrast and brightness levels. Reprinted with permission from ref (4) (CC BY 4.0).

Deep Learning Models

Let denote the SEM images and represents the K classes of materials. We use to represent the N training data points (pairs of images and labels). Before being fed into the model, the SEM images will be downsampled to the resolution level of 64 by 64 and the grayscale on each pixel will be normalized into the range of [0, 1]. Given training and validation data sets, our goal is then to learn a classifier that predicts the quality of material samples in unseen test data set. We trained the following vanilla and uncertainty-aware models: Vanilla Softmax: our uncertainty-unaware baseline simply regards the softmax outputs provided by the DNN as the predictive probabilities. Unfortunately, high confidence softmax predictions can be woefully incorrect and may fail to indicate when they are likely mistaken[11] or detecting OOD.[8] Dropout: we use Dropout,[18] a variational-inference-based Bayesian uncertainty quantification approach. During the training process, Dropout DNN is trained to minimize the approximating distribution (i.e., a product of the Bernoulli distribution across the DNN parameters) and the Bayesian posterior for the DNN parameters. At inference time, Dropout predicts the outputs by Monte-Carlo sampling the network with randomly dropped-out units and averaging them, which is equivalent to integrating the posterior distribution and the predictive likelihood. Simple and computationally lightweight nature of Dropout may provide approximated posteriors that are inaccurate in some scenarios.[19−21] Deep Ensemble: finally, we include Deep Ensembles,[22] a practical, scalable, and non-Bayesian uncertainty quantification alternative for DNNs. As the name suggests, the core idea is to train the DNN classifier in an identical manner (i.e., same model architecture, training data, and training procedure) multiple times but each time with a random initialization of model parameters. With T DNN classifiers (parameterized by θ, t = 1, ..., T) being included in the ensemble, the prediction probability vector for Deep Ensembles is the averaged softmax vector of each DNN. We use a wide residual network (WRN)[23] architecture due to its strong performance on benchmark computer vision data sets.[24] We apply a depth of 16 and a widen factor of 2, and the summary network structure is given in Figure . Given the WRN architecture, we train these DNN models with the cross entropy loss function and the Adam optimizer.[25] The DNN model is trained from scratch (therefore, no pretraining). We set the learning rate of 0.001 and decay the learning rate by half every 50 epochs. We use a minibatch size of 64 and a weight decay factor of 5e–4. Hyperparameters (including learning rate, weight decay factor, as well as the network depth and width) were determined through the HpBandster toolbox, an efficient tool for hyperparameter optimization.[26] We used 200 epochs with early stopping mechanism to terminate training when the validation performance did not improve after 100 epochs. For Dropout, we used a dropout rate of p = 0.3 and T = 16 dropout samples for inference. For Deep Ensembles, we train T = 16 models. All other hyperparameters are kept the same as in the standard baseline case. No image preprocessing technique was adopted in the training process. Horizontal flips were used for training data augmentation. We randomly divide the labeled SEM image data set into 80, 10, and 10% splits for training, validation, and testing, respectively. The codes can be found online.a

Figure 3

Wide ResNet architecture with a depth of 16 and a width of 2. The notation (k × k, n) in the convolutional block and residual blocks denotes a filter of size k and n channels. The dimensionality of outputs from each block is also annotated. The detailed structure of the residual block is shown in the dashed line box. Note that batch normalization and ReLU precede the convolution layers and fully connected layer but omitted in the figure for clarity.

Performance Evaluation Metrics

We evaluate the performance of DNN models in terms of their (a) predictive performance and (b) predictive uncertainty quality on the testing data set. The predictive performance is measured using the Classification Accuracy metric (i.e., the percentage of correct predictions among all data points). On the other hand, following ref (27), we use the Shannon entropy[28] as the metric to quantify the uncertainty inside the prediction probability vector Basically, it captures the average amount of information contained in the prediction: attains its maximum value when the classifier prediction is purely uninformative (assign all classes equal probability 1/K) and attains its minimum value when the classifier is absolutely certain about its prediction (assign zero probability to all but one class). The quality of the predictive uncertainties is quantified using the following metrics: Negative log-likelihood (NLL): it is a standard metric of the uncertainty quality by calculating the log of the joint probabilities of predictions on all test samples[29]Lower NLL indicates better uncertainty quality. Expected calibration error (ECE): calibration accounts for the degree of consistency between the predictive probabilities and the empirical accuracy. We adopt a popular calibration metric ECE,[30] measuring the average absolute discrepancies between the prediction confidence versus the accuracywhere we sort the test data points according to their confidence (the prediction probability for the most likely label, i.e., maxp(y|x)) and bin them into Nb quantiles. Here, acc(B) and conf(B) are the average accuracy and confidence of points in the jth bin, respectively, and B is the number of data points in such a bin. We use 20 equal-spaced bins to measure ECE in this paper. Lower ECE is more favorable.

Results and Discussion

In this section, we first provide details on the DL models and their performance on the in-distribution test data. Then, we discuss and evaluate the performance of proposed solutions on the three aforementioned use cases.

Performance Evaluation Result

On the test data set, Table shows that Deep Ensembles outperforms the rest of the methods in accuracy and ECE. Dropout achieves the best NLL and also a much better ECE than the Softmax baseline. Intriguingly, the two uncertainty quality metrics rank Deep Ensembles differently (best ECE and worst NLL). This might be attributed to the well-known pitfall of NLL to overpenalize the existence of samples with very low prediction probabilities for their true classes.[14] Thus, we recommend ECE as the default uncertainty quality metric to avoid such a pitfall and also for its better interpretability (accuracy vs confidence). Overall, these results demonstrate the effectiveness of uncertainty-aware DNN approaches over the Softmax baseline. It is important to remember that their performance gain over the baseline is not free. In the case of Deep Ensembles, additional cost must be spent on training and inference. For example, we trained an ensemble of 16 classifiers independently, and the computational cost is 16 times higher than the baseline. For Dropout, albeit the training cost is comparable to baseline, the inference computation is similarly expensive as Deep Ensembles.

Table 1

Accuracy and Uncertainty Quality of Different Methodsa

Approaches	Accuracy (↑)	NLL (↓)	ECE (↓)
Baseline	92.3%	0.919	5.38%
Dropout	92.1%	0.903	2.79%
Deep Ensembles	95.3%	0.920	1.52%

Bold values: P < 0.05.

Bold values: P < 0.05. The success of Deep Ensembles is anticipated. After all, ensembling multiple machine learning models is a well-known treatment to reduce the error of a single model, which has been theoretically[31,32] and empirically[33,34] explained. Comparing to Dropout, the other ensemble-like technique, we hypothesize that Deep Ensembles learns multiple models whose predictions are much more diverse (lowly correlated) given the high dimensionality and nonconvexity of DNN parameter spaces, which is crucial for enhancing the classification error and the uncertainty quantification quality.[35]

DL Trustworthiness Case Studies

In this section, we show how uncertainty scores can be leveraged to answer the aforementioned important problems to make the Material Discovery workflows dependable.

Case Study 1: How Much Data Is Required to Train a DNN?

The need for large amounts of labeled training data is often the bottleneck to the successful deployment of machine learning models. This is especially crucial for DNNs due to their overparameterized nature.[36,37] Yet, in scientific applications, obtaining high-fidelity labels can be expensive due to the associated costly and time-consuming experiments.[38] Therefore, an important issue is to decide how much training data is required to achieve the desired accuracy level, which allows the user to prioritize experimental plans accordingly. Conventionally, this task is done by generating the learning curve,[9] which approximately represents the relationship between the training data amount and the validation accuracy on a set of labeled data unused in training.[36] One can even further predict the needed training data size to achieve the required accuracy by extrapolating the learning curves.[39,40] However, a drawback of such conventional validation-based learning curve approach is its reliance on a large amount of labeled data unused in training to accurately evaluate the validation accuracy, which can be expensive or even infeasible in many applications. Here, we ask the question of whether we can leverage the predictive uncertainty information to solve this use case with access to unlabeled validation data (i.e., SEM images without material class labels) only. Specifically, we decide to test if the average prediction confidence (which can be computed without any labeled information) on the unlabeled data set can be used as a surrogate to assess the validation accuracy and approximately generate the learning curve. The logic behind such an approach is that as we have discussed in the previous subsection, for DNN models with well-calibrated uncertainties (low ECE), the average confidence should closely match the accuracy. To examine the feasibility, we train DNN classifiers with varying amount of training data (ranging from 10, 20 to 100% of the maximum available training data set size). We monitor the average validation accuracy as well as the predicted accuracy based on confidence and plot the corresponding learning curves in Figure . We see a significant difference between curves for Softmax. Specifically, it overestimates the validation accuracy and will result in underestimating the needed training data amount. For Dropout, the two curves consistently stay close but seem to be weakly correlated since the average confidence can sometimes decrease while the validation accuracy keeps improving. This weak correlation can be harmful in certain scenarios. For example, the users might want to determine if the DNN performance continues to improve as training data grows, and the Dropout-predicted learning curves may lead them to wrong decisions. For Deep Ensembles, the predicted learning curve not only closely matches the actual one but also shows nearly identical trends.

Figure 4

Uncertainty-guided learning curves. The predicted (confidence-based) and actual learning curves using different approaches.

Uncertainty-guided learning curves. The predicted (confidence-based) and actual learning curves using different approaches. To summarize, these results show that uncertainty scores from both Dropout and Deep Ensembles can be leveraged to predict the required amount of training data to achieve a certain validation accuracy without having access to labeled validation data.

Case Study 2: How to Equip DNNs with a Reject Option?

Next, we tested another practical use of predictive uncertainties to identify confusing samples so that DNN can be refrained from making a prediction. This reduces the risk of making an erroneous decision by rejecting to trust the classifier on certain instances and refer other difficult material samples for further testing and evaluation, instead of making an erroneous decision. This idea, formally referred to as selective classification,[41,42] has been introduced recently in the context of DNN classifiers.[43] In this case study, we design the reject mechanism based on the predictive entropy, where the user is allowed to reject the prediction if the entropy of a DNN prediction exceeds a certain threshold. The quality of uncertainty can then be reflected by the effectiveness of the reject mechanism. To measure the performance, we adopted the risk-coverage trade-off curve[42,43] of selective classification. Holistically, the coverage refers to the ratio of data points for which the classifier is confident enough (i.e., the predictive uncertainty is lower than a given threshold), while the risk presents the classification error among such sufficiently confident points. The ideal goal would be to minimize the risk while maximizing the coverage. We compared different approaches for selective classification based on the risk-coverage trade-off curves in Figure . We see that the trade-off curves based on the baseline Softmax and Dropout uncertainties are nearly identical, while Deep Ensembles performed much better on this task. For example, while offering the same level of 90% coverage (i.e., the classifier will reject to classify on the top 10% instances that it is mostly uncertain about), Deep Ensembles has around 1.5% classification error, much lower than the other two approaches (3.5%). This further verifies the superior uncertainty qualities of Deep Ensembles and presents us another practical benefit of well-calibrated prediction uncertainties.

Figure 5

Uncertainty-guided decision referral. Risk-coverage curves for different DNN approaches.

Uncertainty-guided decision referral. Risk-coverage curves for different DNN approaches. To summarize, our results show that Deep Ensemble uncertainty-guided decision referral can dramatically improve the classification accuracy on the nonreferred material samples while maintaining a minimal fraction of referred (rejected) material samples.

Case Study 3: How to Make DNNs Recognize Out-of-Distribution Examples?

In the real-world setting, DNNs often encounter data collected in a different condition from those being used in the DNN training process. This can occur because of (a) changes in the image acquisition conditions, (b) changes in the synthesis conditions (e.g., discovery of a new material), or (c) unrelated data samples. In such cases, it is crucial to have a detection mechanism to flag such out-of-distribution (OOD) data points that are far away from the training data’s distribution. In this section, we test the potential use of predictive uncertainties for detecting OOD data points, with the underlying logic being that DNN models with well-calibrated uncertainties should assign higher predictive uncertainties to the OOD instances. We formulate the OOD detection problem as a binary classification problem based on the predictive entropy and quantify the performance of the corresponding OOD classifiers. The OOD data are regarded as the positive class and the in-distribution data as the negative class, and the OOD classifiers make decisions solely based on the values of prediction entropy. We adopt the evaluation metric in ref (44) and measure the classification performance using the receiver operating characteristic curve (ROC) and the area under the curve (AUC). The ROC curve plots the false-positive rate (the probability of in-distribution data being classified as OOD) versus the true positive rate (the probability of OOD data being classified as OOD) for different thresholds on the entropy. Therefore, the closer the ROC curve is to the upper left corner (0.0, 1.0), the better the OOD classifier is,[45] while a totally uninformative classifier would exhibit a diagonal ROC curve. The AUC provides a quantitative measure on the ROC curves by measuring the areas under them. A higher (closer to 1) AUC value is better, while the uninformative classifier has an AUC of 0.5.

Detecting Changes in Image Acquisition Conditions

To facilitate a more concrete understanding on the necessity of such OOD detection mechanism, let us first examine how our DNN classifiers perform on some real-world OOD SEM images. This particular OOD phenomenon is caused by replacing the SEM filament. As a result, while the brightness and contrast settings of SEM images were held constant before and after the filament change, the newly collected images are different from the data we used in training and testing (see Figure ) for the same lots. This is particularly applicable in an automated image collection workflow, where image collection conditions are usually set to static and constant conditions without human intervention. Such a type of OOD is commonly denoted as covariate shift in machine learning community.[46] We recorded the DNN classification accuracy on the in-distribution (original filament) and the OOD (replaced filament) SEM images from the same material lot in Table . The gaps between in-distribution and OOD accuracy are substantial, ranging from 13 to 19%. This highlights the risk of being misled by DNN’s erroneous predictions when encountering such real-world OOD data.

Figure 6

Effect of changing filaments on the SEM images while maintaining fixed brightness and contrast acquisition settings from the same lot.

Table 2

In-distribution and OOD Classification Accuracy for the AT Lot

Test accuracy	Softmax	Dropout	Deep Ensembles
In-distribution	83.5%	92.1%	89.5%
OOD	70.4%	73.1%	70.9%
Drop due to OOD	13.1%	19.0%	18.6%

Effect of changing filaments on the SEM images while maintaining fixed brightness and contrast acquisition settings from the same lot. Now, we focus our attention on the problem of detecting the data generated using different image acquisition conditions than the training data. In this experiment, we use 1000 SEM images from the existing SEM image data set as the in-distribution data and obtain 1000 covariate shift (replaced filament) images for the same material as the OOD data. The ROC curves and AUC are shown in Figure . The classifiers based on both Softmax and Dropout uncertainties performed poorly (flat ROC curves and low AUC value), indicating that the OOD detection based on such uncertainties will not work. This is due to the fact that the difference between in- and out-of-distribution images is subtle in this experiment, making the detection task very challenging. On the other hand, the OOD classifier based on Deep Ensembles performed much better (visually from the ROC curve or quantitatively from AUC). From a practical point of view, it might be the only OOD detector capable of identifying a large number of replaced filament images without triggering a high volume of false-positive alarms. The superiority of Deep Ensembles is not completely surprising—it aligns with some prior research[14] that also identified Deep Ensembles as the best performer on covariate shift data.

Figure 7

ROC curves and AUC for detecting covariate-shifted SEM material images based on the predictive uncertainties from different approaches.

Detecting Changes in Synthesis Conditions

In this experiment, we push the OOD data set further away from the training data distribution. Specifically, the OOD data points are SEM images for some unseen classes of the TATB crystal material, i.e., they do not belong to the 30 classes in the training data set due to different manufacturing techniques or postprocessing (i.e., grinding), which will produce very different looking TATB crystals. The occurrence of OOD data from unseen classes will be frequently encountered in realistic applications such as material discovery due to synthesis condition changes. As seen from Figure , all examined approaches achieve acceptable performance (AUC higher than 0.7), meaning that they should be applicable to distinguish the SEM images from novel material classes.

Figure 8

ROC curves and AUC for detecting SEM material images of unseen classes based on the predictive uncertainties from different approaches.

Detecting Unrelated Data

Finally, we examine an extreme case for OOD detection, where the OOD samples are truly far away (or unrelated) from the distribution of material SEM images. For this case, we obtained OOD images from the CIFAR-10 natural image data set (including 10 categories of images, such as cats, dogs, and birds).[47] As the CIFAR-10 images were RGB color images on lower (32 by 32) resolution, some grayscale transformation and upsampling were conducted to convert them into the format of SEM images (64 by 64 grayscale). The detection results are shown in Figure . Deep Ensembles achieved a near-perfect detection result (AUC close to 1). Interestingly, the simple Softmax baseline also performed better than Dropout, as the latter only achieved a 0.42 AUC (worse than randomly guessing). Although CIFAR-10 images are visually distinguishable from material SEM images, the results show that it is still nontrivial to obtain a good uncertainty-based OOD detector. The near-perfect performance of Deep Ensembles is very impressive and validates the superiority of its uncertainty estimates over Softmax and Dropout.

Figure 9

ROC curves and AUC for detecting unrelated CIFAR-10 images based on the predictive uncertainties from different approaches.

Can We Leverage Uncertainties to Identify Different Types of Shifts?

In this section, we ask the question if uncertainty-guided OOD detection approaches can differentiate among different sources of distribution shifts. This is an important feature to have as this might inform users on what should they do with the OOD data, for example, if the user could utilize the OOD data to augment the existing training data (the case of changing image acquisition conditions), conduct further testing (the case of changing synthesis conditions), or simply discard the data (the case of unrelated data). To answer this question, we characterize the distribution of predictive entropy for in-distribution and OOD data using histograms in Figures –12. Intuitively, we expect the predictive entropy of OOD data to be always higher (i.e., more uncertain) than in-distribution ones. Furthermore, this discrepancy should become more noticeable as the OOD data shifts away from the training data distribution. However, from Figures and 11, we observe that both Softmax and Dropout are very confident (assigning low entropy values) on their predictions for domain shift and CIFAR OOD data, although they both performed reasonably well for Unseen-class data. On the other hand, as seen in Figure , Deep Ensembles always produces higher predictive entropy for all examined OOD data sets, and the gap between in-distribution and OOD samples’ predictive entropy indeed becomes more apparent with an increase in the amount of shift. In other words, Deep Ensembles can differentiate among different sources of shifts.

Figure 10

Softmax: histogram comparisons of the predictive entropy for the in-distribution and out-of-distributions from various data sets.

Figure 12

Deep Ensembles: histogram comparisons of the predictive entropy for the in-distribution and out-of-distributions from various data sets.

Figure 11

Dropout: histogram comparisons of the predictive entropy for the in-distribution and out-of-distributions from various data sets.

Softmax: histogram comparisons of the predictive entropy for the in-distribution and out-of-distributions from various data sets. Dropout: histogram comparisons of the predictive entropy for the in-distribution and out-of-distributions from various data sets. Deep Ensembles: histogram comparisons of the predictive entropy for the in-distribution and out-of-distributions from various data sets. To summarize, our results show that uncertainties from Deep Ensembles can be used to detect out-of-distribution samples. Further, their uncertainties are able to differentiate the sources of distribution shifts and hint toward what to do with the OOD data, e.g., using the OOD data with changed image acquisition conditions in data augmentation, conducting new mechanical testing after detecting the OOD data from unseen classes of materials, or simply discarding the unrelated OOD data.

Conclusions

In this work, we successfully demonstrated the benefits, applicability, and limitations of uncertainty-aware deep learning methods for making material discovery workflows more dependable. Specifically, we showed how uncertainty-guided methods can serve as a unified approach to answer several important issues in the examined material classification problem. There are still some issues yet to be resolved for a successful application of machine learning in Material Discovery workflows, but leveraging uncertainties in DL models is a first step to addressing the implementation of DL models for material applications.

6 in total

1. Obtaining Well Calibrated Probabilities Using Bayesian Binning.

Authors: Mahdi Pakdaman Naeini; Gregory F Cooper; Milos Hauskrecht
Journal: Proc Conf AAAI Artif Intell Date: 2015-01

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

3. Reliable Prediction Errors for Deep Neural Networks Using Test-Time Dropout.

Authors: Isidro Cortés-Ciriano; Andreas Bender
Journal: J Chem Inf Model Date: 2019-06-26 Impact factor: 4.956

4. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review.

Authors: Waseem Rawat; Zenghui Wang
Journal: Neural Comput Date: 2017-06-09 Impact factor: 2.026

5. Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks.

Authors: Isidro Cortés-Ciriano; Andreas Bender
Journal: J Chem Inf Model Date: 2018-10-30 Impact factor: 4.956

Review 6. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.

Authors: M H Zweig; G Campbell
Journal: Clin Chem Date: 1993-04 Impact factor: 8.327

6 in total