Jize Zhang1, Bhavya Kailkhura1, T Yong-Jin Han2. 1. Center for Applied Scientific Computing, Computing Directorate, Lawrence Livermore National Laboratory, Livermore, California 94550, United States. 2. Materials Science Division, Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California 94550, United States.
Abstract
In this paper, we leverage predictive uncertainty of deep neural networks to answer challenging questions material scientists usually encounter in machine learning-based material application workflows. First, we show that by leveraging predictive uncertainty, a user can determine the required training data set size to achieve a certain classification accuracy. Next, we propose uncertainty-guided decision referral to detect and refrain from making decisions on confusing samples. Finally, we show that predictive uncertainty can also be used to detect out-of-distribution test samples. We find that this scheme is accurate enough to detect a wide range of real-world shifts in data, e.g., changes in the image acquisition conditions or changes in the synthesis conditions. Using microstructure information from scanning electron microscope (SEM) images as an example use case, we show that leveraging uncertainty-aware deep learning can significantly improve the performance and dependability of classification models.
In this paper, we leverage predictive uncertainty of deep neural networks to answer challenging questions material scientists usually encounter in machine learning-based material application workflows. First, we show that by leveraging predictive uncertainty, a user can determine the required training data set size to achieve a certain classification accuracy. Next, we propose uncertainty-guided decision referral to detect and refrain from making decisions on confusing samples. Finally, we show that predictive uncertainty can also be used to detect out-of-distribution test samples. We find that this scheme is accurate enough to detect a wide range of real-world shifts in data, e.g., changes in the image acquisition conditions or changes in the synthesis conditions. Using microstructure information from scanning electron microscope (SEM) images as an example use case, we show that leveraging uncertainty-aware deep learning can significantly improve the performance and dependability of classification models.
Deep
learning (DL) techniques are achieving remarkable success
in a wide range of scientific applications.[1] One such application is Material Informatics that applies DL methods
to accelerate the discovery, synthesis, optimization, and deployment
of materials.[2] As a motivating example,
consider the problem of Material Discovery where one is interested
in screening novel materials that meet certain performance requirements.[3] A critical obstacle in developing and deploying
materials in a timely manner is understanding the complex process,
structure, property, and performance (PSPP) relationships, including
evaluating the property of interest of a material, which can take
a significant amount of time and effort. DL offers opportunities to
potentially accelerate this process by learning the complex structure–property
relationships (e.g., materials’ physical, mechanical, optoelectronic,
and thermal properties) from historical characterization and property
data (i.e., training data) and produce models that can predict material
properties at dramatically lower overall costs on unseen test data.
Thus, DL is becoming increasingly prevalent in Material Informatics
workflows for making potentially important decisions, thereby making
applications of DL approaches in Material Discovery application a
high stake in nature. For example, incorrect decisions or predictions
in Material Discovery can have (a) significant costs in terms of experimental
resources and time when testing bad-performing materials recommended
by DL or (b) lost opportunities to discover high-performing materials
rejected by DL.The majority of efforts in Material Informatics
are currently devoted
to training deep neural networks (DNNs) that can achieve high accuracy
on holdout data set from the training distribution.[4,5] Existing
efforts implicitly assume ideal conditions during both training and
testing by assuming (a) having access to a sufficiently large labeled
training data and (b) test data from the “same distribution”
as the training set. Unfortunately, these conditions are seldomly
met in Material Discovery applications since a large amount of relevant
historical data is rarely available and the test data is typically
and systematically different from the training data either through
noise[6] or other changes in the distribution.[7] This is an inherent challenge in applying DL
to scientific domain whose aim is to find something “new”
and “different” that outperforms existing materials.
It is well known that DL models are highly susceptible to such distributional
shifts, which often leads to unintended and potentially harmful behavior
(i.e., overconfident on predictions), especially when trained with
insufficient amount of data.[8] Therefore,
to ensure that the trained models behave reliably and to establish
DL as a dependable solution in high-stake Material Discovery applications,
the following equally important questions need to be answered.
How Much Data
Is Required to Train a DNN?
The quality
and amount of training data are often the most important factors that
determine the accuracy of a DL model. However, determining the amount
of data necessary to train a DNN to achieve high classification accuracy
is a challenging problem. A reasonable approach is to model validation
accuracy as a function of the amount of training data to predict the
optimal data size needed to achieve a certain classification performance
(referred to as the learning curve-based approach).[9] Unfortunately, to get a reliable estimate of validation
accuracy, this approach requires a large amount of labeled data not
used in the training process. This is clearly not feasible for the
application of interest where labeling process (e.g., performing experiments
to collect relevant data) is expensive and time consuming.To
overcome this challenge, we propose to leverage the DNN prediction
uncertainty (or confidence) information and approximate the validation
accuracy in the learning curve using unlabeled data only. Specifically,
we obtained the average confidence (i.e., the predicted probability
of the most plausible label) over the set of unlabeled data as the
estimated validation accuracy for generating the learning curve. We
show that having access to unlabeled scanning electron microscopy
(SEM) images is sufficient in our approach to determine with high
precision how many more material samples should be imaged and collected
to provide sufficient DL training data to achieve the desired accuracy
label.
How to Equip DNNs with a Reject Option?
The performance
and dependability of DNN classification models can be dramatically
improved by building in a reject option. Testing the DL model on difficult
(or ambiguous) material samples is a common case in material science
applications since the discovery of new materials often involves moving
away from existing materials using different synthesis and processing
conditions. In such a context, having a reject option allows DNN models
to refer a subset of difficult material samples for further evaluation
and testing while making predictions on the rest. This is also known
as selective classification where the classifier abstains from making
a decision when the model is not confident while keeping coverage
as high as possible. An added benefit to deploying selective classification
is that this approach has the potential to result in substantial improvements
in the performance of the remaining material data. Unfortunately,
many off-the-shelf ML models are not well calibrated, i.e., the probability
associated with the predicted class label does not match the probability
of such prediction being correct.[10,11] It has been
observed that this issue is more noticeable in complex models such
as DNNs.[12] Intuitively, this observation
implies that DNNs are particularly bad at recognizing ambiguous samples.To overcome this challenge, we propose to use state-of-the-art
uncertainty-aware DNNs[13,14] in Material Discovery workflows
as they are known to be better calibrated as compared to their baseline
counterparts. Specifically, we show that uncertainty-guided decision
referral can substantially improve the classification accuracy on
the nonreferred samples while reducing the number of referred (i.e.,
rejected) samples.
How to Make DNNs Recognize Out-of-Distribution
(OOD) Examples?
Test data in real-world DL applications usually
deviates from the
training data distribution, e.g., due to sample selection bias, nonstationarity,
and noise corruptions in some cases.[8] For
example, in the material science application (i.e., analysis of SEM
images), the deviation between train and test data may arise due to
(a) changes in the equipment used to capture the training material
samples or (b) changes in the material synthesis process. Having plentiful
training data that can potentially cover all possibilities and variance
is ideal, but in practice, that is rarely attainable. Thus, a desirable
feature in Material Discovery workflow is for a model to be aware
and not be very confident on test data that is very far (or different)
from the data used to train it. For example, a potentially novel material
that is different from all training data should result in a request
for expert help rather than misclassification into a known material
class as the detection of undiscovered material is in fact a task
of significant interest. In other words, we need to require DNNs to
be aware of cases when data acquisition setup or synthesis conditions
are so different from those used during training that DL predictions
cannot be trusted. Unfortunately, DNNs often make overconfident misclassification
in the presence of distributional shifts and out-of-distribution (OOD)
data.[15] Accurate predictive uncertainty
is a highly desirable property in such cases as it can help practitioners
to assess the true performance and risks to decide whether the model
predictions should (or should not) be trusted.[16]Similar to the second question, we propose to leverage
the predictive uncertainties produced by the state-of-the-art uncertainty-aware
DNNs to overcome this challenge in Material Discovery workflows. Specifically,
we show that predictive uncertainty of uncertainty-aware DNNs is indicative
enough to differentiate among the in-distribution data, data captured
with different equipment, and data generated with different synthesis
conditions.To put these questions into context, we consider
the problem of
differentiating materials based on their microstructures as observed
by their SEM images. This problem is formulated as an K-class classification problem where each class corresponds to a specific
material, with unique characteristics determined by a unique set of
synthesis parameters.We use deep neural networks (DNNs) to
determine whether DL models
can learn to differentiate materials purely based on their complex
microstructures, which are often challenging by human assessment for
samples with similar looking microstructures but very different mechanical
test behaviors (Figure ). By determining the accuracy of the DL models along with prediction
uncertainties, we may begin to understand what microstructures lead
to samples with desired performance metrics and avoid extensive testing
and evaluation for newly synthesized materials with high confidence
predictions. By demonstrating that DL can featurize microstructural
information from SEM images and classify each material with high accuracy,
we can start to build confidence in DL models to screen, optimize,
and discover materials in significantly shorter amount of time compared
to an Edisonian approach.
Figure 1
Schematic of a workflow for SEM image-based
material classification
task.
Schematic of a workflow for SEM image-based
material classification
task.
Main Contributions
Deep learning-based solutions for Material Informatics applications
have so far been suggested sometimes without considering these important
questions listed in the previous section. Yet, techniques to ensure
the dependability of automated decisions are crucial for integrating
DL in Material Discovery workflows. The main contribution of this
paper is to show that uncertainty-aware DL is a unified solution that
is capable of answering all of these questions by leveraging predictive
uncertainty of DNNs. We demonstrate the applicability of our technique
by using DL models to classify the microstructural differences of
a material based on their corresponding SEM images. The summary of
our findings is as follows.First, we show that by leveraging predictive uncertainty,
one can estimate classification accuracy at a given training sample
size without relying on labeled data. This serves a general methodology
for determining the training data set size necessary to achieve a
certain target classification accuracy.Next, we show that predictive uncertainty can be a guiding
principle to decide which material samples should be referred to a
material scientist for further testing and evaluation instead of making
a DL-based prediction. We find that this uncertainty-guided decision
referral can dramatically improve the classification accuracy on the
remaining (i.e., nonreferred) examples.Finally, we show that predictive uncertainty can be
used to detect distributional changes in the test data. We find that
this scheme is accurate enough to detect a wide range of real-world
shifts, e.g., due to changes in the imaging instrument or changes
in the synthesis conditions.Although
we focus on a specific material application in this paper,
the proposed methodology is quite generic and can be used to make
the application of DL to a wide range of scientific domains dependable
and trustworthy.
Methods
Data Sets
Our
main data set is composed of SEM images
of 30 different lots of 2,4,6-triamino-1,3,5-trinitrobenzene (TATB).
Here, a lot refers to TATB crystals produced under a specified synthesis/processing
condition. TATB is an insensitive high-explosive compound of interest
for both the Department of Energy and Department of Defense.[17] After each lot has been synthesized (with different
synthesis conditions), each lot is analyzed with a Zeiss Sigma HD
VP SEM to produce high-resolution scanned images while holding image
acquisition conditions (e.g., brightness and contrast settings) fixed
across all lots/images. Each image tile consists of 1000 × 1000
pixels, with a corresponding field of view of 256.19 μm ×
256.19 μm. The combined images captured for the 30 lots resulted
in 59 690 grayscale SEM images. Thus, the labeled data of interest
for DNNs corresponds to 59 690 grayscale SEM images labeled
with unique designators per class (30 in total). Example SEM images
for TATB for some classes are provided in Figure . One can notice strong visual discrepancy
across SEM images from different lots (or classes).
Figure 2
Representative SEM images
to illustrate the typical microstructural
variability for different TATB lots. The varying particle size, porosity,
polydispersity, and facetness can be clearly observed. Images have
been processed to normalize image contrast and brightness levels.
Reprinted with permission from ref (4) (CC BY 4.0).
Representative SEM images
to illustrate the typical microstructural
variability for different TATB lots. The varying particle size, porosity,
polydispersity, and facetness can be clearly observed. Images have
been processed to normalize image contrast and brightness levels.
Reprinted with permission from ref (4) (CC BY 4.0).
Deep Learning Models
Let denote the
SEM images and represents the K classes
of materials. We use to represent the N training
data points (pairs of images and labels). Before being fed into the
model, the SEM images will be downsampled to the resolution level
of 64 by 64 and the grayscale on each pixel will be normalized into
the range of [0, 1]. Given training and validation data sets, our
goal is then to learn a classifier that predicts the quality of material
samples in unseen test data set.We trained the following vanilla
and uncertainty-aware models:Vanilla Softmax: our uncertainty-unaware baseline simply
regards the softmax outputs provided by the DNN as the predictive
probabilities. Unfortunately, high confidence softmax predictions
can be woefully incorrect and may fail to indicate when they are likely
mistaken[11] or detecting OOD.[8]Dropout: we use
Dropout,[18] a variational-inference-based
Bayesian uncertainty quantification
approach. During the training process, Dropout DNN is trained to minimize
the approximating distribution (i.e., a product of the Bernoulli distribution
across the DNN parameters) and the Bayesian posterior for the DNN
parameters. At inference time, Dropout predicts the outputs by Monte-Carlo
sampling the network with randomly dropped-out units and averaging
them, which is equivalent to integrating the posterior distribution
and the predictive likelihood. Simple and computationally lightweight
nature of Dropout may provide approximated posteriors that are inaccurate
in some scenarios.[19−21]Deep Ensemble: finally,
we include Deep Ensembles,[22] a practical,
scalable, and non-Bayesian uncertainty
quantification alternative for DNNs. As the name suggests, the core
idea is to train the DNN classifier in an identical manner (i.e.,
same model architecture, training data, and training procedure) multiple
times but each time with a random initialization of model parameters.
With T DNN classifiers (parameterized by θ, t = 1, ..., T) being included in the ensemble, the prediction probability vector
for Deep Ensembles is the averaged softmax vector of each DNN.We use a wide residual network (WRN)[23] architecture due to its strong performance on
benchmark computer
vision data sets.[24] We apply a depth of
16 and a widen factor of 2, and the summary network structure is given
in Figure . Given
the WRN architecture, we train these DNN models with the cross entropy
loss function and the Adam optimizer.[25] The DNN model is trained from scratch (therefore, no pretraining).
We set the learning rate of 0.001 and decay the learning rate by half
every 50 epochs. We use a minibatch size of 64 and a weight decay
factor of 5e–4. Hyperparameters (including learning
rate, weight decay factor, as well as the network depth and width)
were determined through the HpBandster toolbox, an efficient tool
for hyperparameter optimization.[26] We used
200 epochs with early stopping mechanism to terminate training when
the validation performance did not improve after 100 epochs. For Dropout,
we used a dropout rate of p = 0.3 and T = 16 dropout samples for inference. For Deep Ensembles, we train T = 16 models. All other hyperparameters are kept the same
as in the standard baseline case. No image preprocessing technique
was adopted in the training process. Horizontal flips were used for
training data augmentation. We randomly divide the labeled SEM image
data set into 80, 10, and 10% splits for training, validation, and
testing, respectively. The codes can be found online.a
Figure 3
Wide ResNet architecture with a depth of 16 and a width of 2. The
notation (k × k, n) in the convolutional block and residual blocks denotes a filter
of size k and n channels. The dimensionality
of outputs from each block is also annotated. The detailed structure
of the residual block is shown in the dashed line box. Note that batch
normalization and ReLU precede the convolution layers and fully connected
layer but omitted in the figure for clarity.
Wide ResNet architecture with a depth of 16 and a width of 2. The
notation (k × k, n) in the convolutional block and residual blocks denotes a filter
of size k and n channels. The dimensionality
of outputs from each block is also annotated. The detailed structure
of the residual block is shown in the dashed line box. Note that batch
normalization and ReLU precede the convolution layers and fully connected
layer but omitted in the figure for clarity.
Performance Evaluation Metrics
We evaluate the performance
of DNN models in terms of their (a) predictive performance and (b)
predictive uncertainty quality on the testing data set.The
predictive performance is measured using the Classification Accuracy
metric (i.e., the percentage of correct predictions among all data
points). On the other hand, following ref (27), we use the Shannon entropy[28] as the metric to quantify the uncertainty inside the prediction
probability vector Basically, it captures the average amount
of information contained in the prediction: attains
its maximum value when the classifier
prediction is purely uninformative (assign all classes equal probability
1/K) and attains its minimum value when the classifier
is absolutely certain about its prediction (assign zero probability
to all but one class). The quality of the predictive uncertainties
is quantified using the following metrics:Negative log-likelihood (NLL): it is a standard metric
of the uncertainty quality by calculating the log of the joint probabilities
of predictions on all test samples[29]Lower NLL indicates better
uncertainty quality.Expected calibration
error (ECE): calibration accounts
for the degree of consistency between the predictive probabilities
and the empirical accuracy. We adopt a popular calibration metric
ECE,[30] measuring the average absolute discrepancies
between the prediction confidence versus the accuracywhere we sort the test data points according
to their confidence (the prediction probability for the most likely
label, i.e., maxp(y|x)) and bin them into Nb quantiles. Here, acc(B) and conf(B)
are the average accuracy and confidence of points in the jth bin, respectively, and B is the number of data points in such a bin. We use 20 equal-spaced
bins to measure ECE in this paper. Lower ECE is more favorable.
Results and Discussion
In this section,
we first provide details on the DL models and
their performance on the in-distribution test data. Then, we discuss
and evaluate the performance of proposed solutions on the three aforementioned
use cases.
Performance Evaluation Result
On the test data set, Table shows that Deep Ensembles
outperforms the rest of the methods in accuracy and ECE. Dropout achieves
the best NLL and also a much better ECE than the Softmax baseline.
Intriguingly, the two uncertainty quality metrics rank Deep Ensembles
differently (best ECE and worst NLL). This might be attributed to
the well-known pitfall of NLL to overpenalize the existence of samples
with very low prediction probabilities for their true classes.[14] Thus, we recommend ECE as the default uncertainty
quality metric to avoid such a pitfall and also for its better interpretability
(accuracy vs confidence). Overall, these results demonstrate the effectiveness
of uncertainty-aware DNN approaches over the Softmax baseline. It
is important to remember that their performance gain over the baseline
is not free. In the case of Deep Ensembles, additional cost must be
spent on training and inference. For example, we trained an ensemble
of 16 classifiers independently, and the computational cost is 16
times higher than the baseline. For Dropout, albeit the training cost
is comparable to baseline, the inference computation is similarly
expensive as Deep Ensembles.
Table 1
Accuracy and Uncertainty
Quality of
Different Methodsa
Approaches
Accuracy
(↑)
NLL (↓)
ECE (↓)
Baseline
92.3%
0.919
5.38%
Dropout
92.1%
0.903
2.79%
Deep Ensembles
95.3%
0.920
1.52%
Bold values: P <
0.05.
Bold values: P <
0.05.The success of Deep
Ensembles is anticipated. After all, ensembling
multiple machine learning models is a well-known treatment to reduce
the error of a single model, which has been theoretically[31,32] and empirically[33,34] explained. Comparing to Dropout,
the other ensemble-like technique, we hypothesize that Deep Ensembles
learns multiple models whose predictions are much more diverse (lowly
correlated) given the high dimensionality and nonconvexity of DNN
parameter spaces, which is crucial for enhancing the classification
error and the uncertainty quantification quality.[35]
DL Trustworthiness Case Studies
In this section, we
show how uncertainty scores can be leveraged to answer the aforementioned
important problems to make the Material Discovery workflows dependable.
Case
Study 1: How Much Data Is Required to Train a DNN?
The need
for large amounts of labeled training data is often the
bottleneck to the successful deployment of machine learning models.
This is especially crucial for DNNs due to their overparameterized
nature.[36,37] Yet, in scientific applications, obtaining
high-fidelity labels can be expensive due to the associated costly
and time-consuming experiments.[38] Therefore,
an important issue is to decide how much training data is required
to achieve the desired accuracy level, which allows the user to prioritize
experimental plans accordingly.Conventionally, this task is
done by generating the learning curve,[9] which approximately represents the relationship between the training
data amount and the validation accuracy on a set of labeled data unused
in training.[36] One can even further predict
the needed training data size to achieve the required accuracy by
extrapolating the learning curves.[39,40] However, a
drawback of such conventional validation-based learning curve approach
is its reliance on a large amount of labeled data unused in training
to accurately evaluate the validation accuracy, which can be expensive
or even infeasible in many applications. Here, we ask the question
of whether we can leverage the predictive uncertainty information
to solve this use case with access to unlabeled validation data (i.e.,
SEM images without material class labels) only. Specifically, we decide
to test if the average prediction confidence (which can be computed
without any labeled information) on the unlabeled data set can be
used as a surrogate to assess the validation accuracy and approximately
generate the learning curve. The logic behind such an approach is
that as we have discussed in the previous subsection, for DNN models
with well-calibrated uncertainties (low ECE), the average confidence
should closely match the accuracy.To examine the feasibility,
we train DNN classifiers with varying
amount of training data (ranging from 10, 20 to 100% of the maximum
available training data set size). We monitor the average validation
accuracy as well as the predicted accuracy based on confidence and
plot the corresponding learning curves in Figure . We see a significant difference between
curves for Softmax. Specifically, it overestimates the validation
accuracy and will result in underestimating the needed training data
amount. For Dropout, the two curves consistently stay close but seem
to be weakly correlated since the average confidence can sometimes
decrease while the validation accuracy keeps improving. This weak
correlation can be harmful in certain scenarios. For example, the
users might want to determine if the DNN performance continues to
improve as training data grows, and the Dropout-predicted learning
curves may lead them to wrong decisions. For Deep Ensembles, the predicted
learning curve not only closely matches the actual one but also shows
nearly identical trends.
Figure 4
Uncertainty-guided learning curves. The predicted
(confidence-based)
and actual learning curves using different approaches.
Uncertainty-guided learning curves. The predicted
(confidence-based)
and actual learning curves using different approaches.To summarize, these results show that uncertainty scores
from both
Dropout and Deep Ensembles can be leveraged to predict the required
amount of training data to achieve a certain validation accuracy without
having access to labeled validation data.
Case Study 2: How to Equip
DNNs with a Reject Option?
Next, we tested another practical
use of predictive uncertainties
to identify confusing samples so that DNN can be refrained from making
a prediction. This reduces the risk of making an erroneous decision
by rejecting to trust the classifier on certain instances and refer
other difficult material samples for further testing and evaluation,
instead of making an erroneous decision. This idea, formally referred
to as selective classification,[41,42] has been introduced
recently in the context of DNN classifiers.[43]In this case study, we design the reject mechanism based on
the predictive entropy, where the user is allowed to reject the prediction
if the entropy of a DNN prediction exceeds a certain threshold. The
quality of uncertainty can then be reflected by the effectiveness
of the reject mechanism. To measure the performance, we adopted the
risk-coverage trade-off curve[42,43] of selective classification.
Holistically, the coverage refers to the ratio of data points for
which the classifier is confident enough (i.e., the predictive uncertainty
is lower than a given threshold), while the risk presents the classification
error among such sufficiently confident points. The ideal goal would
be to minimize the risk while maximizing the coverage.We compared
different approaches for selective classification based
on the risk-coverage trade-off curves in Figure . We see that the trade-off curves based
on the baseline Softmax and Dropout uncertainties are nearly identical,
while Deep Ensembles performed much better on this task. For example,
while offering the same level of 90% coverage (i.e., the classifier
will reject to classify on the top 10% instances that it is mostly
uncertain about), Deep Ensembles has around 1.5% classification error,
much lower than the other two approaches (3.5%). This further verifies
the superior uncertainty qualities of Deep Ensembles and presents
us another practical benefit of well-calibrated prediction uncertainties.
Figure 5
Uncertainty-guided
decision referral. Risk-coverage curves for
different DNN approaches.
Uncertainty-guided
decision referral. Risk-coverage curves for
different DNN approaches.To summarize, our results show that Deep Ensemble uncertainty-guided
decision referral can dramatically improve the classification accuracy
on the nonreferred material samples while maintaining a minimal fraction
of referred (rejected) material samples.
Case Study 3: How to Make
DNNs Recognize Out-of-Distribution
Examples?
In the real-world setting, DNNs often encounter
data collected in a different condition from those being used in the
DNN training process. This can occur because of (a) changes in the
image acquisition conditions, (b) changes in the synthesis conditions
(e.g., discovery of a new material), or (c) unrelated data samples.
In such cases, it is crucial to have a detection mechanism to flag
such out-of-distribution (OOD) data points that are far away from
the training data’s distribution.In this section, we
test the potential use of predictive uncertainties for detecting OOD
data points, with the underlying logic being that DNN models with
well-calibrated uncertainties should assign higher predictive uncertainties
to the OOD instances. We formulate the OOD detection problem as a
binary classification problem based on the predictive entropy and
quantify the performance of the corresponding OOD classifiers. The
OOD data are regarded as the positive class and the in-distribution
data as the negative class, and the OOD classifiers make decisions
solely based on the values of prediction entropy. We adopt the evaluation
metric in ref (44) and
measure the classification performance using the receiver operating
characteristic curve (ROC) and the area under the curve (AUC). The
ROC curve plots the false-positive rate (the probability of in-distribution
data being classified as OOD) versus the true positive rate (the probability
of OOD data being classified as OOD) for different thresholds on the
entropy. Therefore, the closer the ROC curve is to the upper left
corner (0.0, 1.0), the better the OOD classifier is,[45] while a totally uninformative classifier would exhibit
a diagonal ROC curve. The AUC provides a quantitative measure on the
ROC curves by measuring the areas under them. A higher (closer to
1) AUC value is better, while the uninformative classifier has an
AUC of 0.5.
Detecting Changes in Image Acquisition Conditions
To
facilitate a more concrete understanding on the necessity of such
OOD detection mechanism, let us first examine how our DNN classifiers
perform on some real-world OOD SEM images. This particular OOD phenomenon
is caused by replacing the SEM filament. As a result, while the brightness
and contrast settings of SEM images were held constant before and
after the filament change, the newly collected images are different
from the data we used in training and testing (see Figure ) for the same lots. This is
particularly applicable in an automated image collection workflow,
where image collection conditions are usually set to static and constant
conditions without human intervention. Such a type of OOD is commonly
denoted as covariate shift in machine learning community.[46] We recorded the DNN classification accuracy
on the in-distribution (original filament) and the OOD (replaced filament)
SEM images from the same material lot in Table . The gaps between in-distribution and OOD
accuracy are substantial, ranging from 13 to 19%. This highlights
the risk of being misled by DNN’s erroneous predictions when
encountering such real-world OOD data.
Figure 6
Effect of changing filaments
on the SEM images while maintaining
fixed brightness and contrast acquisition settings from the same lot.
Table 2
In-distribution and OOD Classification
Accuracy for the AT Lot
Test accuracy
Softmax
Dropout
Deep Ensembles
In-distribution
83.5%
92.1%
89.5%
OOD
70.4%
73.1%
70.9%
Drop due to OOD
13.1%
19.0%
18.6%
Effect of changing filaments
on the SEM images while maintaining
fixed brightness and contrast acquisition settings from the same lot.Now, we focus our attention on the problem of detecting the data
generated using different image acquisition conditions than the training
data. In this experiment, we use 1000 SEM images from the existing
SEM image data set as the in-distribution data and obtain 1000 covariate
shift (replaced filament) images for the same material as the OOD
data. The ROC curves and AUC are shown in Figure . The classifiers based on both Softmax and
Dropout uncertainties performed poorly (flat ROC curves and low AUC
value), indicating that the OOD detection based on such uncertainties
will not work. This is due to the fact that the difference between
in- and out-of-distribution images is subtle in this experiment, making
the detection task very challenging. On the other hand, the OOD classifier
based on Deep Ensembles performed much better (visually from the ROC
curve or quantitatively from AUC). From a practical point of view,
it might be the only OOD detector capable of identifying a large number
of replaced filament images without triggering a high volume of false-positive
alarms. The superiority of Deep Ensembles is not completely surprising—it
aligns with some prior research[14] that
also identified Deep Ensembles as the best performer on covariate
shift data.
Figure 7
ROC curves and AUC for detecting covariate-shifted SEM material
images based on the predictive uncertainties from different approaches.
ROC curves and AUC for detecting covariate-shifted SEM material
images based on the predictive uncertainties from different approaches.
Detecting Changes in Synthesis Conditions
In this experiment,
we push the OOD data set further away from the training data distribution.
Specifically, the OOD data points are SEM images for some unseen classes
of the TATB crystal material, i.e., they do not belong to the 30 classes
in the training data set due to different manufacturing techniques
or postprocessing (i.e., grinding), which will produce very different
looking TATB crystals. The occurrence of OOD data from unseen classes
will be frequently encountered in realistic applications such as material
discovery due to synthesis condition changes. As seen from Figure , all examined approaches
achieve acceptable performance (AUC higher than 0.7), meaning that
they should be applicable to distinguish the SEM images from novel
material classes.
Figure 8
ROC curves and AUC for detecting SEM material images of
unseen
classes based on the predictive uncertainties from different approaches.
ROC curves and AUC for detecting SEM material images of
unseen
classes based on the predictive uncertainties from different approaches.
Detecting Unrelated Data
Finally,
we examine an extreme
case for OOD detection, where the OOD samples are truly far away (or
unrelated) from the distribution of material SEM images. For this
case, we obtained OOD images from the CIFAR-10 natural image data
set (including 10 categories of images, such as cats, dogs, and birds).[47] As the CIFAR-10 images were RGB color images
on lower (32 by 32) resolution, some grayscale transformation and
upsampling were conducted to convert them into the format of SEM images
(64 by 64 grayscale). The detection results are shown in Figure . Deep Ensembles
achieved a near-perfect detection result (AUC close to 1). Interestingly,
the simple Softmax baseline also performed better than Dropout, as
the latter only achieved a 0.42 AUC (worse than randomly guessing).
Although CIFAR-10 images are visually distinguishable from material
SEM images, the results show that it is still nontrivial to obtain
a good uncertainty-based OOD detector. The near-perfect performance
of Deep Ensembles is very impressive and validates the superiority
of its uncertainty estimates over Softmax and Dropout.
Figure 9
ROC curves and AUC for
detecting unrelated CIFAR-10 images based
on the predictive uncertainties from different approaches.
ROC curves and AUC for
detecting unrelated CIFAR-10 images based
on the predictive uncertainties from different approaches.
Can We Leverage Uncertainties to Identify Different Types of
Shifts?
In this section, we ask the question if uncertainty-guided
OOD detection approaches can differentiate among different sources
of distribution shifts. This is an important feature to have as this
might inform users on what should they do with the OOD data, for example,
if the user could utilize the OOD data to augment the existing training
data (the case of changing image acquisition conditions), conduct
further testing (the case of changing synthesis conditions), or simply
discard the data (the case of unrelated data).To answer this
question, we characterize the distribution of predictive entropy for
in-distribution and OOD data using histograms in Figures –12. Intuitively, we expect the predictive
entropy of OOD data to be always higher (i.e., more uncertain) than
in-distribution ones. Furthermore, this discrepancy should become
more noticeable as the OOD data shifts away from the training data
distribution. However, from Figures and 11, we observe that both
Softmax and Dropout are very confident (assigning low entropy values)
on their predictions for domain shift and CIFAR OOD data, although
they both performed reasonably well for Unseen-class data. On the
other hand, as seen in Figure , Deep Ensembles always produces higher predictive
entropy for all examined OOD data sets, and the gap between in-distribution
and OOD samples’ predictive entropy indeed becomes more apparent
with an increase in the amount of shift. In other words, Deep Ensembles
can differentiate among different sources of shifts.
Figure 10
Softmax: histogram comparisons
of the predictive entropy for the
in-distribution and out-of-distributions from various data sets.
Figure 12
Deep Ensembles: histogram comparisons of the predictive
entropy
for the in-distribution and out-of-distributions from various data
sets.
Figure 11
Dropout: histogram comparisons of the predictive entropy
for the
in-distribution and out-of-distributions from various data sets.
Softmax: histogram comparisons
of the predictive entropy for the
in-distribution and out-of-distributions from various data sets.Dropout: histogram comparisons of the predictive entropy
for the
in-distribution and out-of-distributions from various data sets.Deep Ensembles: histogram comparisons of the predictive
entropy
for the in-distribution and out-of-distributions from various data
sets.To summarize, our results show
that uncertainties from Deep Ensembles
can be used to detect out-of-distribution samples. Further, their
uncertainties are able to differentiate the sources of distribution
shifts and hint toward what to do with the OOD data, e.g., using the
OOD data with changed image acquisition conditions in data augmentation,
conducting new mechanical testing after detecting the OOD data from
unseen classes of materials, or simply discarding the unrelated OOD
data.
Conclusions
In this work, we successfully
demonstrated the benefits, applicability,
and limitations of uncertainty-aware deep learning methods for making
material discovery workflows more dependable. Specifically, we showed
how uncertainty-guided methods can serve as a unified approach to
answer several important issues in the examined material classification
problem. There are still some issues yet to be resolved for a successful
application of machine learning in Material Discovery workflows, but
leveraging uncertainties in DL models is a first step to addressing
the implementation of DL models for material applications.