Francesca Tavazza1, Brian DeCost1, Kamal Choudhary1. 1. Materials Science and Engineering Division, National Institute of Standards and Technology, Gaithersburg, Maryland, 20899, United States.
Abstract
Uncertainty quantification in artificial intelligence (AI)-based predictions of material properties is of immense importance for the success and reliability of AI applications in materials science. While confidence intervals are commonly reported for machine learning (ML) models, prediction intervals, i.e., the evaluation of the uncertainty on each prediction, are not as frequently available. In this work, we compare three different approaches to obtain such individual uncertainty, testing them on 12 ML-physical properties. Specifically, we investigated using the quantile loss function, machine learning the prediction intervals directly, and using Gaussian processes. We identify each approach's advantages and disadvantages and end up slightly favoring the modeling of the individual uncertainties directly, as it is the easiest to fit and, in most of the cases, minimizes over- and underestimation of the predicted errors. All data for training and testing were taken from the publicly available JARVIS-DFT database, and the codes developed for computing the prediction intervals are available through the JARVIS-tools package. Not subject to U.S. Copyright. Published 2021 by American Chemical Society.
Uncertainty quantification in artificial intelligence (AI)-based predictions of material properties is of immense importance for the success and reliability of AI applications in materials science. While confidence intervals are commonly reported for machine learning (ML) models, prediction intervals, i.e., the evaluation of the uncertainty on each prediction, are not as frequently available. In this work, we compare three different approaches to obtain such individual uncertainty, testing them on 12 ML-physical properties. Specifically, we investigated using the quantile loss function, machine learning the prediction intervals directly, and using Gaussian processes. We identify each approach's advantages and disadvantages and end up slightly favoring the modeling of the individual uncertainties directly, as it is the easiest to fit and, in most of the cases, minimizes over- and underestimation of the predicted errors. All data for training and testing were taken from the publicly available JARVIS-DFT database, and the codes developed for computing the prediction intervals are available through the JARVIS-tools package. Not subject to U.S. Copyright. Published 2021 by American Chemical Society.
Artificial
intelligence (AI) approaches for materials science have
been explored for decades,[1−9] but only in the last few years, they have become consistently successful
across a wide variety of materials science tasks. While materials
science is still data-poor compared to many other application areas
of AI modeling, modern instruments and powerful computational facilities
are finally able to produce enough data for successful application
of machine learning (ML) in this area. The Materials Genome Initiative
(MGI) has also contributed to this flourishing, by supporting the
creation of large, publicly accessible databases, with very consistently
computed data. With respect to atomistic computational results, regression
models were first successfully applied to the evaluation of energetics-related
properties, like formation energy,[10−12] and then expanded to
predict elastic,[13−15] electronic,[5,16,17] optical,[18] and many other physical properties.
Computational parameters needed in simulations, not just physical
quantities, can be predicted using regression models as well such
as for k-points, as discussed in ref (19). JARVIS-ML[12] is an example of a publicly available suite
of ML models for a variety of physical properties. Specifically, all
its models apply to the prediction of properties for ideal, crystalline
materials and have been trained on the density functional theory (DFT)
data in the JARVIS-DFT database.[20,21]ML methods
are intrinsically statistical in nature, as the true
underlying model form is not available to the ML model. However, quite
surprisingly, during this period of explosion of AI applied to materials
science, relatively little effort has been spent in evaluating its
uncertainty, beyond assessing the average ability of the model. While
there is a fair amount of discussion on how to evaluate uncertainty
in ML/materials research,[22−24] on how to estimate prediction
intervals,[25−30] and on how to collect data to improve ML models,[31] the uncertainty on the individual prediction is often not
rigorously evaluated or reported when machine learning material properties.
Uncertainty quantification (UQ) is an extremely important step in
any experiment or computational assessments, as it determines how
trustworthy the measured or computed data are. The same is true for
the predictions of ML. Currently, the majority of ML/materials papers
exclusively provide uncertainty evaluation on the average ability
of ML models, providing quantities like the mean average error (MAE),
the mean square error (MSE), or the root-mean-square error (RMSE).
This approach fails to address the uncertainty/confidence of predictions
for individual instances. For instance, when applying an ML model
for formation energy to an unknown compound, we want to know if that
specific material will be stable, and an evaluation of the average
performance of the model over the entire dataset is not necessarily
significant. Moreover, because the MSE-type stats make assumptions
about the future data that we know we cannot satisfy (independent,
identically distributed data) in materials discovery and design settings,
only providing these quantities as quality evaluators may be deceiving
about the real capability of the ML model. The reason behind ML-UQ
focusing mostly on population variables is that determining prediction
intervals, i.e., the uncertainty on each specific prediction, can
be computationally fairly expensive, and usually, it requires an additional
computational effort beyond the training of the model. Gaussian processes[32] are an exception to this rule, as the individual
uncertainty is automatically determined when the model is fit. However,
they have other limitations that often prevent them from being the
approach of choice, as we will briefly discuss later in this work.
Obviously, no amount of UQ can make up/reveal bad quality in the training
data as systematic errors in the generation of the dataset will propagate
through a machine learning model. For example, standard DFT is known
to underestimate electronic bandgaps, so any ML model trained on such
data will also predict underestimated bandgaps.In this work,
we estimate prediction intervals using three straightforward
and easy-to-implement approaches, one that allows to choose which
population interval to be covered by the UQ (quantile approach), the
second, based on directly learning the error itself and applicable
to any ML algorithm, and the third that automatically provides the
uncertainty of each prediction when fitting the model (Gaussian processes).
Each method has its advantages and disadvantages, and they are the
focus of the Results/Discussion section. These
methods are applied to energetic, mechanical, electronic, and optical
properties, and their results across properties are compared to each
other. All the ML models in this work are trained and validated on
the same DFT data so that they can be compared in a completely consistent
way. Specifically, for both training and validation, we use the DFT
data contained in the JARVIS-DFT database.[20,21]
Methods
All ML models discussed in this paper
are regression models as
implemented in LightGBM[33] in the case of
gradient boosting decision tree (GBDT), or in scikit-learn,[34] for Gaussian processes (GP) and random forest
(RF). We use JARVIS-ML-based classical force-field inspired descriptors
(CFID)[12] when using GBDT, gradient boosting
and random forest, and a subset of such descriptors when using GP.
The CFID descriptors provide a complete set of structural and chemical
features (1557 for each material), and CFID-based models have been
successfully used to develop more than 25 highly accurate ML property
prediction models.[12]All the ML models
in this work are trained and validated on the
DFT data contained in the JARVIS-DFT database[20,21] (04.26.2020 version, https://doi.org/10.6084/m9.figshare.6815699). For each property, this allows for a consistent comparison between
methodologies. However, the dataset size depended on the property
under examination, as computing some properties is much more involved
than computing others. The largest datasets were available for formation
energy (35,885 materials), energy above the convex hull (34,880 materials),
and bandgaps computed using the OPTB88vdW functional (OPT) (35,932
materials). The smallest dataset is for exfoliation energies, counting
only 806 data. Table S3 gives the dataset
size for each property that was examined in this work. In addition
to the size of the dataset, its distribution is also an important
characteristic and is shown for some key properties in Figure S1.To determine prediction intervals,
the first methodology employed
in this work estimates the error through the evaluation of an upper
and a lower prediction bound for the quantity under examination. In
the rest of the paper, we will call this approach “quantile”
as it is based on using the quantile loss function. Machine learning
models are often fit through the minimization of a loss function[22−24] as such a function quantifies the errors that the model makes on
the training data. There are many options for such loss functions,
depending on factors like the distribution of the data, the choice
of the machine learning algorithm, the quality of the data, etc. Some
loss functions penalize low and high errors equally, and others do
not. The least-square loss function (least square, aka MSE) is a commonly
used regression loss among those that penalize low and high errors
equally, meaning that a positive and a negative error of the same
amount incur the same penalty. Obviously, errors of different amounts
are penalized differently. Specifically, when using MSE,optimal parameters are determined
by minimizing the sum of squared residuals. Here, yP are the values predicted for the target values y. In other words, MSE optimizes for the mean of the
distribution.Mean average error (MAE) is another example of
loss function that
penalizes low and high errors equally. MAE is less sensitive to outliers
than MSE, so depending on the problem, it may be a better choice.
Other loss functions penalize low and high errors unequally, therefore
allowing optimization by percentiles. The quantile loss function (quantile)
is part of this latter group, and what percentile it optimizes for
depends on the choice of the quantile parameter α[22]where α is between 0
and 1. Such a property is the base of the first method that we use
to estimate the prediction intervals, therefore the name “quantile”
for this approach. For each physical quantity that we are interested
in, we will independently optimize three ML models, one for αupper, one for αlower, and one for αmid, where αmid = 0.5 (i.e., the median).
In the rest of the paper, we will refer to these models as UPPER,
LOWER, and MID, respectively. The prediction interval for each predicted
data is given bywhere yupper (ylower) are the values predicted for y using
αupper (αlower) and yP is the value predicted for y using αmid. This will be
referred to as the “predicted value” in the rest of
the paper.An advantage of this method is that the population
interval to
be covered is arbitrary. Most of the results discussed in this work
have been obtained using αupper = 0.84 and αlower = 0.14 to construct the upper and lower bounds so that
our prediction interval would ideally cover 68% of the population,
corresponding to one standard deviation under the assumption of Gaussian
error distribution. This makes for an easy comparison with the other
methodologies examined in this paper. It is important to note, however,
that the coverage of 68% of the population (in the case of αupper = 0.84 and αlower = 0.14) would only
occur for perfect fits and that, in general, the direct count of how
often the reference value falls within the UPPER–LOWER interval
gives a better idea of the meaning of the uncertainty. We will refer
to this quantity as “inbounds” in the rest of the paper.Algorithm-wise, we utilize the gradient boosting decision tree
(GBDT) regression for the quantile approach because it allows optimization
of an arbitrary differentiable loss function so that the MID model
could be run using either quantile with αmid = 0.5
(i.e., optimizing for the median) or MSE (i.e., optimizing for the
mean). Another advantage of GBDT is that it provides some degree of
interpretability through feature importance ranking, a property that
we took advantage of when choosing descriptors for Gaussian processes.
As a standard practice, we use a 90 to 10% train–test split.[12,15,17] The 10% independent test set
is never used in the hyperparameter optimization or model training
so that when the model is evaluated on it, we obtain unbiased performance
metrics.The second approach that we use to determine prediction
intervals
uses a machine learning model to predict the errors directly. This
means that it requires the fitting of two ML models for each physical
quantity under examination: one to predict the actual value of the
property (“base” model) and one to determine the prediction
intervals (“error” model). Consequently, training the
models requires splitting the data into three groups: one to fit the
base model, one to fit the error model, which can also be used to
validate the base model, and the last one to validate the error model.
Obviously, the third set can be used to validate the base model as
well. Requiring splitting of the data into three sets may be a substantial
disadvantage when the amount of available data is small to begin with.
We chose to work with a 45–45–10 split so that the number
of samples in the third set of data matches the number of data used
to validate the other two methodologies. This makes the comparisons
between approaches more consistent. In the rest of the paper, this
approach will be referred to as “3split” because of
its requirement of splitting the available data into three groups.To model the error using ML, we explored two approaches: one where
for each data point, we predicted the absolute value of the difference
between exact and predicted values (Error_AD), and the other where
we predicted the square of such a difference (Error_SD) insteadThese two options were inspired by the comparison between
mean
absolute error (MAE) and mean square error (MSE) loss functions (L1
versus L2): using the square of the differences is easier to optimize,
but using the absolute value of such differences is more robust to
outliers. They will be indicated as “3split-L1” (eq ) and “3split-L2”
(eq ) in the rest of
the paper.An important advantage of this second approach is
that it does
not require a specific loss function, which means that it can be used
with any regression model. We used the GBDT regressor, for a direct
comparison to the quantile approach, and the random forest regressor,
as our preliminary tests showed it to be very effective even without
hyperparameter optimization. As for the quantile approach, we computed
the inbound percentage for each property model, to establish the meaning
of the determined uncertainty.The third approach that we investigated
uses Gaussian processes
so that individual uncertainties are automatically determined as the
model is trained.[32] Fitting a Gaussian
process (GP) starts with the choice of the covariance function, also
called kernel, so that the prior distribution is defined. A prior
is an infinite-dimensional multivariate Gaussian distribution completely
specified by a mean function, m(x), and a covariance function, k(x, x′)In other words,
the prior does not depend on the training data,
but it is a way to incorporate prior knowledge about the functions.
The next step is determining the posterior distribution (predictive),
which is obtained by limiting the functions going through the training
data. Because we have a Gaussian process prior, the posterior is a
Gaussian process as well: it is completely described by mean and covariance
and automatically provides the UQ estimate. The training of a GP model
involves the model selection (i.e., the choice of the kernel function)
and the optimization of the kernel hyperparameters. In this work,
all fitting parameters (λ for i = 1,..N, l, l′, l″, p, α, noise, and c, where i = 1, 2, and 3) are optimized using the Broyden–Fletcher–Goldfarb–Shanno[35] (L-BGFS-B) algorithm, as implemented in scipy.optimize.minimize.The main advantage of using GP is the fact that the uncertainty
on each prediction is determined together with the prediction value
itself, without requiring any extra effort. Gaussian processes also
are extremely effective with a smaller set of data and allow easy
incorporation of data into the model. However, GP have disadvantages
as well, especially the fact that they lose efficiency in high-dimensional
spaces, i.e., if the number of features exceeds a few dozens or for
large datasets. They are computationally memory-intensive for large
datasets as well.Lastly, like in the cases of quantile and
“3split”,
the meaning of the error on the individual prediction comes from knowing
how often the observed value (the DFT data in this work) falls into
the prediction interval. Because the posterior distribution of a Gaussian
process is a Gaussian itself, the meaning of the uncertainty on each
data should be straightforward. However, as in the case of quantile,
the fitting is never perfect, so it is important to compute its “inbound”
count on the validation set to establish its meaning for each fitted
GP model.
Results
All results discussed in this
session refer to test data. In all
cases, we performed a 90–10 split, so the test data were 10%
of the whole dataset.
Prediction Intervals: Quantile
Approach
When analyzing prediction intervals obtained using
the quantile
approach and eq , the
first characteristic that appears evident is their asymmetry with
respect to the predicted value yP, as shown in Figure a for a subset of
formation energy data. However, for plotting purposes and to be able
to directly compare quantile-given errors to prediction intervals
obtained using the other methodologies, in the rest of the paper,
we will define a quantile error aswhere PredInt(yP) is given by eq , and we will investigate yP ± errquantile.
Figure 1
(a) Quantile prediction intervals for
formation energy (enlargement).
The upper and lower predictions determine the size of the prediction
interval, while the “mid” prediction determines the
predicted value yP. The y = x line shows where the predicted value should
be to match the DFT data. The arrow points out a case where the predicted
intervals are not large enough to include the observed (DFT) data.
The elliptical enclosure shows an example of a predicted value outside
the prediction interval. (b) Formation energy data (enlargement) with
error bars given by errquantile = PredInt(yP)/2. The y = x line indicates
where the predicted values should be to match the DFT data.
(a) Quantile prediction intervals for
formation energy (enlargement).
The upper and lower predictions determine the size of the prediction
interval, while the “mid” prediction determines the
predicted value yP. The y = x line shows where the predicted value should
be to match the DFT data. The arrow points out a case where the predicted
intervals are not large enough to include the observed (DFT) data.
The elliptical enclosure shows an example of a predicted value outside
the prediction interval. (b) Formation energy data (enlargement) with
error bars given by errquantile = PredInt(yP)/2. The y = x line indicates
where the predicted values should be to match the DFT data.This approach is exemplified in Figure b, where the same predicted
data as in Figure a are displayed,
and the sizes of their predicted intervals are unchanged as well,
but now, the error bar is given by errquantile.Figure a also showcases
possible problems that come with using this approach: the arrow points
out a case where the predicted intervals are not large enough to include
the observed (DFT) data, while the elliptical enclosure shows an example
of a predicted value outside the prediction interval. In general, yP may not fall between yupper and ylower because all three values are predicted
independently, using models with different optimized hyperparameters
although identical training and validation sets. The same “problem”
points are pointed out in Figure b, but now, because of the symmetric definition of
the error, the second type of problem is eliminated by construction
(the ellipse case, in the figure). The possibility of having the observed
value (DFT data, indicated by the y = x line in Figure )
outside the prediction interval cannot be eliminated, and therefore,
it is necessary to compute the inbound percentage for each model,
using the validation set.Another point worth making, when discussing
how to use the quantile
approach for uncertainty evaluation, is that the predicted value yP can be determined using the least-square loss
function (MSE), instead of the quantile loss with αmid = 0.5. The use of MSE corresponds to driving the predictions toward
the mean of the distribution instead of its median, and it is the
approach followed in this work.We are now ready to examine
results obtained implementing the quantile
approach. In Figure , we compare two sets of models for each physical quantity of interest.
The first set was obtained independently optimizing the hyperparameters
of LOWER, MID, and UPPER, to achieve the lowest possible MAE in each
case, and it is indicated as the “best results” in the
figure. Figure c shows
that the inbound percentages for the “best results”
are always lower than the wished-for 68% (dashed line) and as low
as 40% in some cases. By choosing UPPER and LOWER models with slightly
larger MAEs, it is possible to increase the inbound counts. The new
results are labeled “higher inbound count”, and while
68% is not always obtained, the inbound counts are closer to it than
before. MAEs for base mid models are displayed in (a), but only for
those quantities for which the property prediction changed between
the two cases. Error MAEs for all quantities are reported in (b).
Figure 2
Comparison
between the lowest MAE models (“best results”)
and models with an inbound count closer to 68% (“higher inbound
count”), using quantile. MAEs for base mid models are shown
in (a), corresponding MAEs for error models in (b), and inbound counts
in (c). An example of the prediction distribution is given in (d),
for the bulk modulus in the “best results” case. The
color of each point represents the size of its prediction interval.
The dashed line in (c) indicates the 68% inbound counts that would
give easy interpretability to the error bars. Each MAE is in the units
of the corresponding property (bulk_mod and shear_mod = bulk and shear
modulus, respectively (GPa); e_hull = energy above the convex hull
(eV); op_gap and mbj_gap = bandgaps using OPTB88vdW and MBJ data,
respectively (eV); spillage, no units; epsx (mepsx) = refractive index
along x using OPTB88vdW and MBJ data, respectively
(no units); Max_ir = maximum infrared frequency (cm–1); n_Seeb = Seebeck coefficient for electrons (mV/K)).
Comparison
between the lowest MAE models (“best results”)
and models with an inbound count closer to 68% (“higher inbound
count”), using quantile. MAEs for base mid models are shown
in (a), corresponding MAEs for error models in (b), and inbound counts
in (c). An example of the prediction distribution is given in (d),
for the bulk modulus in the “best results” case. The
color of each point represents the size of its prediction interval.
The dashed line in (c) indicates the 68% inbound counts that would
give easy interpretability to the error bars. Each MAE is in the units
of the corresponding property (bulk_mod and shear_mod = bulk and shear
modulus, respectively (GPa); e_hull = energy above the convex hull
(eV); op_gap and mbj_gap = bandgaps using OPTB88vdW and MBJ data,
respectively (eV); spillage, no units; epsx (mepsx) = refractive index
along x using OPTB88vdW and MBJ data, respectively
(no units); Max_ir = maximum infrared frequency (cm–1); n_Seeb = Seebeck coefficient for electrons (mV/K)).
Prediction Intervals: ML Modeling of the Error
Directly
As the main advantage of modeling the error directly
is that any ML algorithm can be used, in Figure , we compare optimized gradient boosting
decision tree (GBDT) results to random forest (RF) ones, where RF
was run using default parameters only. The reason for such a comparison
is the fact that fitting ML models and, especially, optimizing their
hyperparameters are a computationally expensive endeavor. Adding UQ
on individual predictions necessarily increases such computational
cost, even significantly if ad hoc hyperparameter optimization is
needed. A possible work-around is to avoid using hyperparameter fitting,
at least for the UQ part. Such an approach is feasible only if an
algorithm is available, which produces acceptable MAE using default
hyperparameters. Our investigation identified random forest as such
an algorithm, with default parameters as in its scikit-learn implementation.
While default-RF results are clearly worse than those from optimized
GBDT, Figure shows
that their inbound percentage is like the one from GBDT 3split-L1
and that the error modeling is at least mostly qualitatively correct:
blue points are on or near the diagonal, and red and brown points
are away from it. However, the presence of more light blue or green
points on or very near the diagonal in RF results than in GBDT ones
tells us that RF has more cases of overestimation than GBDT.
Figure 3
Property and
error predictions using different algorithms: gradient
boosting decision tree (GBDT) with optimized hyperparameters and errors
defined either as the square or as the absolute value of the difference
between observed and predicted values (L2 or L1, respectively) and
random forest with default parameters. The error on each data point
is indicated by its color. Results are given for the bulk modulus
(top row) and formation energy (bottom row). For each case, MAEs for
both the property and error are reported, as well as their inbound
counts.
Property and
error predictions using different algorithms: gradient
boosting decision tree (GBDT) with optimized hyperparameters and errors
defined either as the square or as the absolute value of the difference
between observed and predicted values (L2 or L1, respectively) and
random forest with default parameters. The error on each data point
is indicated by its color. Results are given for the bulk modulus
(top row) and formation energy (bottom row). For each case, MAEs for
both the property and error are reported, as well as their inbound
counts.Figure also provides
a visual comparison between 3split-L1 and 3split-L2 results: in both
the bulk modulus and the formation energy cases, 3split-L1 has a lower
error MAE but also a lower inbound percentage than 3split-L2. Also,
less light blue-colored points are seen in its plots, especially over
or near the y = x line that indicates
perfect fit, possibly indicating not as much overestimation. Conversely,
more dark-blue points are seen away from the diagonal for 3split-L1
than for 3split-L2, indicating occasional underestimation of the errors.
Finding lower error MAEs and lower inbounds for 3split-L1 than for
3split-L2 is a trend common to all physical quantities that we examined.
More on how 3split-L1 and 3split-L2 models compare to each other and
to quantile and Gaussian process results will be covered in the Discussion
section (Section ). Lastly, it is well-known that prediction intervals are always
wider than confidence intervals,[4] so it
is no major concern that the errors are in many cases larger than
the corresponding MAEs.
Prediction Intervals: Gaussian
Processes
In this work, we limited the investigation of Gaussian
processes
(GP) to six quantities: the formation energy, exfoliation energy,
bulk modulus, bandgap obtained using the Tran–Blaha-modified
Becke–Johnson approach (MBJ bandgap), maximum infrared mode,
and spillage. Such quantities were chosen to cover a variety of different
physical properties, spanning from energetics to elastic, electronic,
optical, and topological while only probing some of the smaller datasets
available to us, as the GP becomes inefficient with larger sets. A
major exception to this criterion is the inclusion of formation energy,
as it comes with one of our largest datasets. It was important to
add it, though, because formation energy is a quantity systematically
investigated in ML works focused on nanoscale materials science, and
therefore, analyzing its single prediction UQ is of widespread interest.The choice of the kernel is crucial to successfully fit GP. The
specific kernel used in this work depends on the property of interest,
but they all were a combination of two or more of the following kernel
functions[36]whereRBF (radial-basis function kernel) is of the form ),
where d(x,x′)
is the Euclidean distance between x and x′
and λ is a vector of length-scale
parameters with as many dimensions as the number of descriptors (N).RQ (rational quadratic
kernel) is of the form , and it is parameterized by a length-scale
parameter l and a scale mixture parameter α.The ExpSineSquared kernel is of the form , where l″ is the
length-scale parameter and p is a periodicity parameter.The WhiteNoise kernel is equal to the noise
parameter
“noise” for x = x′,
and it is null otherwise.c1, c2, c3, and c4 are fitting constants
and can be set to zero.Through the RBF
term, we include the long-term, smooth rising behavior,
while the smaller, medium-term irregularities are made possible through
the RQ component, in the case that periodic trends need to be considered.
As the name hints, the WhiteNoise component of the kernel accounts
for white noise on the data. As our descriptors are very different
in nature, going from chemical ones to structural and coordination-related
ones, we found the importance to allow different length-scale parameters
for each of them, at least in the RDF term. All fitting parameters
(λ for i = 1,..N, l, l′, l″, p, α, noise, and c, where i = 1, 2, and 3) are
optimized using the Broyden–Fletcher–Goldfarb–Shanno
(L-BGFS-B) algorithm.[37]CFID descriptors
provide a complete set of structural and chemical
features through 1557 descriptors for each material. Although 1557
is not an unreasonable number of descriptors/features for most of
the ML models, they are too many to be effectively dealt with using
Gaussian processes. We used the “feature importance”
capability of gradient boosting methodologies to reduce their number
to something manageable, between 15 and 30, depending on the property.
In Figure a, data
leading to our choice of which descriptors to use for each property
are shown. Out of the 50 most important descriptors, the majority
are chemical descriptors, for all properties but optical. After the
chemical ones, either those distance-related or coordination-related
are the most common ones. The error bars come from averaging importance
results obtained in the three most successful GBDT models, for each
property. There is some variability, among runs, on which the descriptor,
in each category, is among the 50 most important. Therefore, the intersection
of the 50 most important descriptors between the three best models,
for each property, gave us a manageable number of descriptors to use
when fitting Gaussian processes. Using a combination of the features,
instead of a down selection of them, is an alternative approach that
loses some of the direct interpretability of each feature but that
may possibly lead to tighter inbound predictions.
Figure 4
(a) Property-dependent
descriptor importance analysis: for each
property, the 50 most important descriptors are classified into their
type (chemical, distance-related (Rdf = radial distribution function),
angle-related (Adf = angular distribution function), dihedral angle-related
(Ddf = dihedral distribution function), and coordination-related (number
of neighbors)). (b) Prediction versus observed (DFT) data for the
bulk modulus. The color gives the predicted error on each individual
prediction, using the same scale as in Figures and 3, to facilitate
comparisons.
(a) Property-dependent
descriptor importance analysis: for each
property, the 50 most important descriptors are classified into their
type (chemical, distance-related (Rdf = radial distribution function),
angle-related (Adf = angular distribution function), dihedral angle-related
(Ddf = dihedral distribution function), and coordination-related (number
of neighbors)). (b) Prediction versus observed (DFT) data for the
bulk modulus. The color gives the predicted error on each individual
prediction, using the same scale as in Figures and 3, to facilitate
comparisons.Figure b shows
our results in the case of the bulk modulus. Compared to quantile
(Figure ) and “3split”
(Figure ) results,
the GP predicts less “very accurate” errors (dark blue,
less than 2 GPa error), many more “accurate” ones (medium
blue, 2 to 10 GPa error), and almost none with error larger than 17
GPa (green or above). This behavior, a reduced dispersion in the predicted
error, occurs consistently in all six investigated properties and
always corresponds to a very high inbound percentage (usually around
80%).
Prediction Intervals: Discussion
After having explored individual results from each of our modeling
approaches, we are finally ready to quantitatively compare them across
all the properties. Figure a shows error MAE results, for all four approaches, having
chosen models with low MAEs and inbound percentages closest to 68%.
Having a comparable inbound count is necessary to get an accurate
MAE comparison. However, as displayed in Figure c, no matter how much time and focus we put
in the fits, we could not obtain models with exactly 68% inbound percentages
(one standard deviation). Again, fitting for lower MAEs and higher
inbound counts at the same time is not straightforward, as the hyperparameter
search is driven only by the minimization of the loss function. Figure c shows that the
GP inbound count is systematically around 80%, 3split-L2 between 70
and 80%, while 3split-L1 is consistently around 60%. The only exception
is exfoliation energy, which is the quantity with an incredibly small
dataset (only about 600 data points), so its results are often unpredictable.
Quantile has the largest spread in its inbound percentages, from just
above 50 to 70%. In terms of MAEs for the error models, quantile and
3split-L1 are consistently lower than 3split-L2 and GP. The systematic
difference between 3split-L1 and 3split-L2 hints to the importance
of how the error is modeled (in 3split-L1, the absolute value of the
error is machine-learned directly, while in 3split-L2, it is the absolute
value of the error to be machine-learned directly). All MAEs and inbound
counts discussed here (error models) are given in Table S1 in the Supporting Information (SI), together with
the corresponding MAEs for the base models (i.e., the models predicting
the physical quantities themselves).
Figure 5
Population variables for the four error
models investigated in
this work: MAEs in (a) and inbound percentages in (b). The same quantities
for rescaled error models in (c) and (d). As the inbound percentages
become very similar among models, the corresponding MAEs get closer
as well, with 3split-L1 being the lowest in most of the properties.
To facilitate comparison, we kept the y-axis scale
the same between the error model as-fitted cases (a) and the rescaled
ones (c). Each MAE is in the units of the corresponding property,
and details on the properties and their units are given in the caption
of Figure .
Population variables for the four error
models investigated in
this work: MAEs in (a) and inbound percentages in (b). The same quantities
for rescaled error models in (c) and (d). As the inbound percentages
become very similar among models, the corresponding MAEs get closer
as well, with 3split-L1 being the lowest in most of the properties.
To facilitate comparison, we kept the y-axis scale
the same between the error model as-fitted cases (a) and the rescaled
ones (c). Each MAE is in the units of the corresponding property,
and details on the properties and their units are given in the caption
of Figure .As an accurate comparison of MAEs for the four
methodologies is
only possible for very similar inbound counts, we rescaled the prediction
intervals, to obtain an inbound percentage of 68%. This was accomplished
by multiplying all validation prediction intervals, for each physical
property and error modeling approach, by a fitted factor, which is
chosen so that the inbound percentage is 68% or as close to it as
possible. Such factors are approach- and property-dependent, and they
are listed in Table S2 in the SI. The error
model MAEs are then recomputed using the scaled validation sample
values. Results are given in Figure . By construction, all rescaled inbound counts are
very close to 68% (Figure d). The corresponding MAEs (Figure c) are closer together than for the raw (not
rescaled) case (Figure a), highlighting how important it is to compare prediction interval-determining
methodologies only when the inbound counts are similar. For certain
properties, all models give essentially the same MAE (MBJ bandgap,
for instance), while in other cases, like for the maximum predicted
infrared mode (Max IR), there is still a substantial difference. Overall,
it is 3split-L1 to achieve the lowest error MAEs across most of the
properties followed by GP. As the “3split” models are
the easiest/fastest to fit, this is a good finding. Quantile does
not always work as well as it could have been expected: in the case
of two properties (OPTB88vdW(OPT)-bandgap and Max IR), it gives the
highest MAEs for the error model.MAEs give important information
on how a model behaves but not
completely. In Figure , the residuals for the base property (top row) and for the error
prediction (bottom row) are displayed, in the case of the bulk modulus.
Similar results are found for all the other properties investigated
in this work. In the top row images, the data point color indicates
the predicted error (i.e., prediction interval). These results show
that while all the approaches are reasonably accurate both as a base
(most of the data fall between +10 and −10 GPa) and as an error
model (most of the blue data points fall between +10 and −10GPa,
and nonblue data fall outside this range), the dispersion of the error
distribution depends on the modeling approach. In other words, how
many blue points actually fall outside the ±10 GPa range and,
conversely, how many nonblue points fall inside such a range depend
on how the prediction interval is determined. To directly investigate
the uncertainty in the error prediction, we plot the residual histogram
for the prediction interval models (Figure , bottom row). A strong dependence on the
modeling approach is evident from these plots, larger than what the
similarities among MAEs could have led us to expect, as the height,
range, and symmetry of the residual distribution vary significantly
among error models. Once again, 3split-L1 produces the best results,
as its error–residual histogram is the most symmetric, is centered
very close to zero, and is the tallest and the one with the smallest
full-width-half-maximum. As the base model is the same among 3split-L1
and 3split-L2, all the differences between these two cases are to
be attributed to the different definitions of “error”.
As already observed when commenting on predicted values versus DFT
ones for GP (Figure b), the GP is characterized by getting a smaller spread in the predicted
uncertainties than the other models, which leads to overestimating
the error of well-predicted data and, in a few cases, to underestimate
the error of badly predicted ones. This is evidenced by its histogram
peak not being centered around zero, as very few prediction intervals
are predicted to be almost zero, even when the corresponding property
values are very close to the exact ones. Conversely, not many large
overpredictions occur, as made evident by the sharp dropping off of
the histogram after 10 GPa.
Figure 6
Top row: residual for the base model (predicted
property –
DFT value) versus the predicted property, in the case of the bulk
modulus, for the four examined approaches. The data point color indicates
the predicted error (prediction interval). The distribution of predicted
intervals depends on the modeling approach. Bottom row: histograms
for the error model residual (predicted error – exact error).
An even stronger dependence on the modeling approach is evident here,
as the height, range, and symmetry of the residual distribution vary
significantly among error models.
Top row: residual for the base model (predicted
property –
DFT value) versus the predicted property, in the case of the bulk
modulus, for the four examined approaches. The data point color indicates
the predicted error (prediction interval). The distribution of predicted
intervals depends on the modeling approach. Bottom row: histograms
for the error model residual (predicted error – exact error).
An even stronger dependence on the modeling approach is evident here,
as the height, range, and symmetry of the residual distribution vary
significantly among error models.
Conclusions
In this work, we compare three
different approaches to determine
the uncertainty on individual machine learning predictions (prediction
intervals). Specifically, we probe the quantile loss function approach,
used machine learning of the error directly, where the error is defined
either as the absolute difference (3split-L1) or the square of the
difference (3split-L2) between the predicted and the observed values,
and used Gaussian processes. All approaches are applied to the modeling
of 12 physical properties, ranging from energetics-related ones to
elastic, optical, electronic, and more. This investigation is necessary
because quantities like MAE only evaluate ML models in a statistical
manner, and while users of such models need to know how reliable the
specific prediction, they are interested in this. We identify each
approach’s advantages and disadvantages, and we learned that
the descriptor choice matters. With good descriptors (like the CFID),
even ML algorithms with default hyperparameters (random forest, in
our investigation) give acceptable results. The choice of hyperparameters
is particularly crucial when developing Gaussian process models. We
find that Gaussian processes give a good estimate for prediction intervals,
although a bit overestimated, but are more time-consuming to fit than
the other approaches. Using the quantile approach requires fitting
three models and, in general, gives lower inbound results than Gaussian
processes or 3split-L2. However, among the three models, it is the
easiest approach to fit for specific inbound results. Machine learning
the error directly has the enormous advantage of allowing the use
of any loss function. However, it requires splitting the dataset into
three parts, which could be a problem if the dataset is small to begin
with. How the error is modeled also makes a difference, and using
the absolute difference between predicted and expected (3split-L1)
values was found to be, overall, the best approach. All data for training
and testing were taken from the publicly available JARVIS-DFT database,
and the codes developed for computing the prediction intervals are
available through JARVIS-tools github (https://github.com/usnistgov/jarvis).
Authors: G Pilania; A Mannodi-Kanakkithodi; B P Uberuaga; R Ramprasad; J E Gubernatis; T Lookman Journal: Sci Rep Date: 2016-01-19 Impact factor: 4.379