Seyed Mohamad Moosavi1, Kevin Maik Jablonka1, Berend Smit1. 1. Laboratory of Molecular Simulation, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL), Rue de l'Industrie 17, CH-1951 Sion, Valais, Switzerland.
Abstract
Developing algorithmic approaches for the rational design and discovery of materials can enable us to systematically find novel materials, which can have huge technological and social impact. However, such rational design requires a holistic perspective over the full multistage design process, which involves exploring immense materials spaces, their properties, and process design and engineering as well as a techno-economic assessment. The complexity of exploring all of these options using conventional scientific approaches seems intractable. Instead, novel tools from the field of machine learning can potentially solve some of our challenges on the way to rational materials design. Here we review some of the chief advancements of these methods and their applications in rational materials design, followed by a discussion on some of the main challenges and opportunities we currently face together with our perspective on the future of rational materials design and discovery.
Developing algorithmic approaches for the rational design and discovery of materials can enable us to systematically find novel materials, which can have huge technological and social impact. However, such rational design requires a holistic perspective over the full multistage design process, which involves exploring immense materials spaces, their properties, and process design and engineering as well as a techno-economic assessment. The complexity of exploring all of these options using conventional scientific approaches seems intractable. Instead, novel tools from the field of machine learning can potentially solve some of our challenges on the way to rational materials design. Here we review some of the chief advancements of these methods and their applications in rational materials design, followed by a discussion on some of the main challenges and opportunities we currently face together with our perspective on the future of rational materials design and discovery.
Over the last few decades, materials chemistry
research has shifted
toward more rational design. There are now many examples, such as
metal–organic frameworks (MOFs),[1] polymers,[2] and DNA nanostructures,[3] in which we have such control over the chemistry
that we can move away from the traditional trial and error. If we
think about rational design, we quickly realize that we are dealing
with large numbers: a large number of materials, a large number of
applications, and a large number of options. Here we argue that the
conventional scientific approach for materials design based on fundamental
laws, computational modeling, and experimentation is challenged when
encountering these large numbers. Therefore, we are now developing
new tools that work on the basis of large data, which might allow
us to overcome some of these challenges. The development of modern
big-data science methodologies, often called machine learning, will
allow us to pursue our aim of understanding and designing of materials
in a new way.Machine learning models try to use the underlying
patterns and
relationships in data to make new predictions. A classical example
is image recognition. We know that there exists a complex relationship
between the pixels of an image and their labels (e.g., dog or cat).
However, trying to find this complex relationship by writing an equation
that takes the images as an input and outputs whether it is an image
of a cat or dog is not only extremely challenging but also does not
lead to a scalable solution, as we need to develop a new equation
for every label. Instead, machine learning methods try to infer this
mapping between pixels and labels from observing many examples. The
underlying idea of machine learning is to use a training set of many
images of all different types of cats and dogs to infer the underlying
patterns in order to predict whether an unseen image shows a dog or
cat. Similarly, if presented with enough examples, machine learning
methods can extract relationships between chemical systems and their
properties and performances that would otherwise require solving equations
that are too complex.In this Perspective, we do not deal with
the question of “how”
to do machine learning. We refer the readers to the comprehensive
review articles and books on the topic of how to implement a machine
learning project. In specific, excellent resources are available for
overviews of the fundamentals of machine learning[4−6] and deep learning[7−10] and their applications in materials design,[11] chemical synthesis,[12−14] and molecular simulation[15] for different classes of materials or applications, e.g., battery
materials[16] and nanoporous materials.[17]Instead, we focus on challenges in the
rational design and discovery
of materials and how we can use machine learning to address them.
Here we use as an example a topic of our research: materials for energy-related
applications. We aim to systematically design or discover materials
that lead to the optimal energy efficiency for any given application.
Typically, we are given some external parameters or constraints that
define the problem, for example, the source and sink for carbon capture,
the operational pressure for methane and hydrogen storage, or the
light spectral distribution and irradiance for solar cells. To systematically
find the optimal solution, we typically follow a multistage design
process (Figure )
that starts with identifying the materials search space, followed
by predicting or measuring materials properties and evaluating or
testing their performance for the target application. We then aim
to search the design space using the knowledge we obtain from our
observations to find the best-performing setup, considering materials
and process. Practically, solving “exactly” the governing
equations of the physical laws for all of these stages is far too
complex; therefore, we need fundamentally different approaches to
become able to deliver solutions for the technological challenges
of our time.
Figure 1
Algorithmic approach for holistic rational material design.
We
start with a problem for which we have conceptualized a solution (e.g.,
an adsorption process or device) with some requirements. For this
concept, we try to find the best materials in the materials search
space to maximize the performance indicators. Because of the complexity
of performance evaluation, we usually select surrogate parameters
(e.g., material properties) that we hope are reasonable surrogates
for the performance in the real world.
Algorithmic approach for holistic rational material design.
We
start with a problem for which we have conceptualized a solution (e.g.,
an adsorption process or device) with some requirements. For this
concept, we try to find the best materials in the materials search
space to maximize the performance indicators. Because of the complexity
of performance evaluation, we usually select surrogate parameters
(e.g., material properties) that we hope are reasonable surrogates
for the performance in the real world.The use of machine learning in materials design and discovery is
a natural consequence of the problem we try to solve: finding needles
in a haystack of materials for any given application. The complexity
of this search requires extracting patterns in the form of design
rules and/or fast methods for exploring these vast spaces for optimization
of processes and materials properties. One can perceive that here
machine learning is not the object of the research but the tool for
doing better science. Doubtless, our research on materials has progressed
so much that we can produce large amounts of high-quality data, which
make the field of materials science ready for abrupt growth if correct
tools are used.This new approach in the design and discovery
of materials brings
us to a new era of materials design and discovery that intrinsically
has new opportunities and challenges. We can now effortlessly solve
some of the classic problems, and as a consequence, we can focus on
more challenging and sometimes new problems. In this Perspective,
we review some of the major advancements in materials understanding
and design that have been made possible by machine learning. We argue
that current progress in machine learning for materials science has
been stunning but that breakthroughs are still to come. Hence, we
discuss some of the challenges to overcome and our vision for the
future direction of research on this topic.
Materials Design and Discovery
Chemical
Space and Databases
Often, finding an optimal
material for a given application is presented as inverse design. One
starts with the application and systematically narrows down the options
to find the holy grail of materials design, the optimal material to
synthesize. In the Introduction, we argued
that such an approach will be fundamentally limited if we always rely
on solving equations, either because of their complexity or our limited
understanding of how to simulate certain phenomena. Therefore, we
explore here the other extreme, which is using machine learning to
infer solutions on the basis of many observations and guide us through
the design process. Since this approach is based on having many observations,
our starting point is collecting a large amount of data, which can
be observations from our materials search space, training data for
our models, or simply the data with which we compare our findings.
Typically, such large amounts of data are accessible from large databases
of materials and properties.In the past decades, crystal structures
of synthesized compounds have been collected in several databases,
including the Inorganic Crystal Structure Database (ICSD),[18] the Crystallographic Open Database (COD),[19] the International Centre for Diffraction Data
(ICDD),[20] and the Cambridge Structural
Database (CSD).[21] In parallel, significant
progress has been made in the development of databases of hypothetical
materials, i.e., structures generated in silico, making it possible
to study materials even before they have ever been synthesized. For
instance, in the field of nanoporous materials, this effort has led
to the development several hypothetical databases[22] of metal–organic frameworks (MOFs),[23−27] covalent organic frameworks (COFs),[28,29] and zeolites.[30,31] These databases altogether contain millions of chemical compounds.These databases have constituted the starting point for computational
high-throughput screening studies. The computational predictions of
these studies have been compiled in repositories and databases that
are mainly focused on managing materials properties data. For instance,
the Materials Project,[32] the Pauling File,[33] Novel Materials Discovery (NOMAD),[34] and Materials Cloud[35] contain computational materials data. Additionally, databases like
the ones from the National Institute of Standards and Technology (NIST)
store experimental properties of materials, including adsorption properties
of porous materials[36] and thermophysical
properties of alloy materials.[37,38]The size of these
databases sounds enormous, yet they represent
only a fraction of all possible chemical structures. Since we rely
on observations to infer solutions, a primary step is to make sure
that we have sufficient diversity in our observations in a database.
The underlying fundamental question is how to make sense of the structure
of a database without knowing the relation between the material and
the performance. In machine learning this question is studied using
unsupervised learning, which deals with unlabeled data to infer patterns,
for example, classifying materials into different clusters or detecting
outliers in a database. We can use unsupervised learning to visualize
the current chemical space and analyze the underlying distribution
of databases.[39,40]To illustrate this, we
can look at an example of unsupervised learning
on MOF databases, several of which are often used as the starting
point for materials discovery via high-throughput computational screenings
or machine learning. For this reason, it is important to understand
how well those databases cover the chemical space of MOFs and how
redundant they are.[41] The first step is
to map MOF structures onto descriptors to quantify their similarities.[17,42,43] Moosavi et al.[41] used an approach that closely follows the MOF chemistry,
in which a MOF is described by four sets of descriptors, for the metal
nodes, linkers, functional groups, and pore geometry. The idea is
that chemically similar MOFs have similar descriptors, which allows
us to quantify this similarity as a distance in descriptor space.
One has to note that similarity depends on the application: If we
are interested in gas separation, all nonporous MOFs are useless,
and hence, for that application they are all the same. However, if
we look at optical properties, pore shape is most likely not very
relevant.For an application for which porosity is important, Figure makes a comparison
of how
the characteristics of pore size and shape are covered by the different
databases. While one would like to have a database that has representative
samples of all possible geometries, we note a clear difference in
the distributions of the geometric properties of the databases. For
example, experimental MOFs (CoRE-2019 database[44]) are mainly in the small-pore region, and in silico ones
(ToBaCCo database[25]) are mainly in the
large-pore region. Indeed, if for a particular application a specific
type of pores is desired, the chance of discovering such a material
is different in each database. Similar analysis of the chemistry of
materials showed a lack of diversity in the metal chemistry of the
hypothetical databases. Notably, a lack of diversity can lead to wrong
conclusions. For example, Moosavi et al. showed that the importance
of metals for carbon capture was underrated in the past studies because
of the lack of metal diversity in the analyzed databases. In addition,
once we have carried out such an analysis, we can also see whether
a new material has a well-known chemistry or pore geometry or whether
this material is completely new. As we are dealing with over 90 000
experimental structures, answering this question without such a big-data
approach to quantify the similarities of materials is difficult.[41,45−47]
Figure 2
Maps of the pore geometry of MOFs. The descriptors of
pore geometry
of MOFs were mapped to two dimensions using the t-distributed stochastic neighbor embedding (t-SNE) method. The t-SNE method preserves local similarities such that materials
similar to each other are located close to each other in the 2D map.
Each dot shows one material, and the structures from different databases
are overlaid on top of the collective map structures from all databases.
The experimental structures are from the CoRE-2019 database,[44] and the hypothetical structures are from the
ToBaCCo[25] and BW-DB[24] databases. From ref (41). CC BY 4.0.
Maps of the pore geometry of MOFs. The descriptors of
pore geometry
of MOFs were mapped to two dimensions using the t-distributed stochastic neighbor embedding (t-SNE) method. The t-SNE method preserves local similarities such that materials
similar to each other are located close to each other in the 2D map.
Each dot shows one material, and the structures from different databases
are overlaid on top of the collective map structures from all databases.
The experimental structures are from the CoRE-2019 database,[44] and the hypothetical structures are from the
ToBaCCo[25] and BW-DB[24] databases. From ref (41). CC BY 4.0.At this point, it is
clear that a systematic data management plan
is inevitable. Such a plan must cover the full spectrum of data management
and curation from discovery to integration and cleaning of data. Interestingly,
we can use machine learning methods for data management and curation.
For example, we lack tabulated data for many interesting applications
and properties. While large amounts of data and scientific knowledge
are available through the literature, the challenge is to discover
and transform such raw, unstructured data embedded in text into contextualized
and structured data. In the context of data discovery and mining,
machine learning methods from natural language processing (NLP) can
help us to extract data from the literature. For example, to address
the lack of data for material synthesis, Kim et al.[48,49] performed text mining of more than 640 000 articles and provided
a data set of synthesis parameters for 30 different oxide materials
in a structured data format.An outstanding example is a recent
work showing that a method called
word embedding can encode knowledge from past publications.[50] Word embeddings are representations of words
as high-dimensional vectors such that they preserve relationships
between words. For example, the distance between similar words (e.g.,
“cathode” and “battery”) will be smaller
in the word embedding space than the distance between dissimilar words
(e.g., “ascorbic acid” and “battery”).
Tshitoyan et al.[50] analyzed 3.3 million
abstracts of materials-science-related articles, containing around
500 000 words, to develop a word embedding that preserves word
appearance in the context proximity of the words. Remarkably, this
word embedding can capture materials science concepts such as the
periodic table and structure–property relationships. For example,
they used the word embedding to make predictions of new thermoelectric
materials (Figure ). To find potential materials for thermoelectric applications, they
investigated the proximity of the word “thermoelectric”
in the word embedding space. A density functional theory prediction
of the properties of the materials that were found in this area is
shown in Figure .
The word embedding not only recovered known thermoelectric materials
but also discovered several new promising candidates. Interestingly,
similar to chemists, the model used common chemical knowledge and
intuition, such as similarities in crystal structure or applications,
or phrases that describe materials properties for the predictions
(see Figure c for
a depiction of how three of the new potential thermoelectric materials
are connected to the word “thermoelectric”). Indeed,
dealing with millions of articles to develop such a comprehensive
view over the chemical literature is a difficult task to address without
machine learning.
Figure 3
Prediction of new materials for thermoelectric applications
using
data mining of the literature. (a) Materials that are found close
to the word “thermoelectric” in the word-embedding space.
(b) The power factors of the materials were computed using density
functional theory, resulting in the discovery of many new potential
materials for the thermoelectric applications. (c) Connecting words
between the newly discovered materials and the word “thermoelectric”.
The figure was redrawn based on ref (50).
Prediction of new materials for thermoelectric applications
using
data mining of the literature. (a) Materials that are found close
to the word “thermoelectric” in the word-embedding space.
(b) The power factors of the materials were computed using density
functional theory, resulting in the discovery of many new potential
materials for the thermoelectric applications. (c) Connecting words
between the newly discovered materials and the word “thermoelectric”.
The figure was redrawn based on ref (50).Besides data discovery,
data curation and cleaning can potentially
benefit from machine learning. In specific, we can exploit the statistical
nature of machine learning methods to clean the input data itself.
Since machine learning models infer the underlying pattern and relationships
from many examples, we can use them to identify anomalous cases, i.e.,
suspicious data points that are different from the majority of similar
data points. In structural and materials properties databases, various
kinds of errors might occur, such as wrong units for properties, spelling
mistakes, data transfer and storage issues, or duplicate structures.
An illustrative example is a recent work on the oxidation states of
MOFs. The oxidation states of metal centers are determined and reported
by chemists for the materials in the CSD. Jablonka et al.[51] developed machine learning models trained on
this collection of knowledge that can predict the oxidation states
with high accuracy. Coupling uncertainty metrics with these predictions,
they were able to identify many incorrect assignments in the CSD.
Therefore, their model could be used to flag potential mistakes in
a large database such as the CSD with more than a million entries.A complementary and effective approach is to perform quality control
of data at the early stage of data generation. Often the production
of big data involves large-scale execution of computational or experimental
workflows, for which we would like to have careful control over the
quality of data generation as well as resource management to avoid
spending valuable resources on fallacious results. Indeed, manual
inspection is intractable in these cases, and automation is needed.
Using machine learning to control the data generation process is a
promising choice in this area of research. An excellent example is
the control of time-consuming first-principles calculations on open-shell
transition metal complexes.[52] These calculations
can frequently fail; for instance, the structure might fall apart
during the geometry optimization. To aid automatic detection of these
cases, a machine learning classifier model was used to predict simulation
outcomes on the sole basis of the chemical composition. Moreover,
a complementary classifier model was used to monitor the trajectory/convergence
of the calculations, aborting those that had a high chance of failure.
Using such models for autonomous job control can avoid generating
data that later might be hard to classify into valid and invalid results,
enhancing the quality of data generation.
From Structure to Properties
The next step is to be
able to predict structural properties reliably and with sufficient
accuracy. In principle, we can use molecular simulation and quantum
mechanics to predict material properties. However, these techniques
are limited to an accuracy–efficiency trade-off and might become
computationally prohibitive depending on the system size, the time
scales of the physical phenomena of interest, and the number of systems
to be investigated. Moreover, some properties like synthesizability
are so complex that we still do not have methods to predict them using
computer simulations. Machine learning methods hold the promise to
shift the paradigm of accuracy and efficiency, enabling exploration
of large databases with high accuracy. Broadly speaking, machine learning
is used in two main ways in this context: to directly map structures
to their properties or to facilitate the development of new modeling
methodologies.Indeed, in recent years several materials properties
have been predicted by machine learning methods. Examples include
gas adsorption,[41,53−56] catalytic,[42,57−59] thermal,[60,61] thermoelectric,[62,63] bulk mechanical,[64−68] and optical and electronic[67,69,70] properties.In principle, not solving the complex equations
and inferring solutions
only from observing many examples allows us to tackle problems that
even state-of-the-art theory is limited to answer. In particular,
finding solutions for fuzzy problems such as materials synthesis,
synthesizability, and oxidation state, for which we do not have a
reliable theory, are the areas of research in which data-driven methods
can play a significant role. Here, machine learning gives us extremely
flexible and elaborate empirical models that can fit the knowledge
of individuals or experimental observations and turn them into powerful
tools. Interestingly, this flexibility does not necessarily mean that
we cannot extract physical insights from these models; it is used
only to circumvent the limitations of conventional analytical equations
that sometimes are not complex enough to fully capture the behavior
of chemical systems. For example, in the case of oxidation states,
empirical models that use pairwise distances between atoms to describe
local geometries (e.g., the bond valence sum) are not sufficiently
elaborate to capture subtle geometric dissimilarities.[51] Moreover, using machine learning can even help
us to develop new theories and extract physical insights from the
model.[71] For example, Cranmer et al.[72] proposed an approach with which symbolic equations
can be derived from a neural network. They used this technique to
find a new equation that describes the concentration of dark matter,
but one can envision that a similar approach could reveal design rules
for materials.One important area of research for machine learning
is to formulate
new modeling methods for quantum and statistical mechanics problems.
Machine learning approaches for molecular simulation are emerging
to solve complex and time-consuming calculations that we typically
encounter in modeling of chemical systems. These methods have already
had a significant impact on the way that we compute configuration
energies and forces and simulate thermodynamic,[73,74] kinetic,[75] electronic,[76] and excited-state[77,78] properties and phenomena.One of the most significant and earliest applications of machine
learning in this area is the development of high-dimensional neural
network potentials to extract the potential energy surface of chemical
systems from quantum mechanical calculations.[79−81] The underlying
assumption here is that the potential energy can be decomposed into
a sum of contributions of local environments. Hence, a machine learning
model that is trained to map these local atom-centered environments
to an energy can be used as a “force field” for simulating
chemical systems. Behler and Parrinello[79] introduced a symmetry function formalism that is by design differentiable
and encodes the required physical invariances, i.e., the energy of
a system is invariant with respect to translation, rotation, and permutation
of atoms. Another seminal approach is the Gaussian Approximation Potential
(GAP) formalism based on the smooth overlap of atomic positions (SOAP)
representation of an atomic environment.[82,83] These potentials provide quantum-mechanical accuracy with the cost
of analytical force fields, allowing accurate simulation of large
systems on long time scales. Recently, attempts to extend them to
different elements of the periodic table have been carried out,[84] and several classes of materials have been successfully
modeled using these frameworks.[61,85−87]One of the main bottlenecks in statistical mechanics is the
simulation
of rare events: events that take place on a time scale that is short
for experiments but is extremely long for a simulation. To simulate
rare events in complex systems, which possess potential energy surfaces
with multiple minima separated by large energy barriers, it is a challenge
to adequately sample the configuration space to reach good statistics.
This is the case because the simulation might get trapped in metastable
states. For this reason, simulation of these systems requires advanced
sampling techniques such as umbrella sampling or replica exchange,[88−91] which try to push the system to move from one minimum configuration
to another. In a remarkable recent development,[92] a machine learning model, i.e., an invertible neural network
model, was used to map the complex and hard-to-sample configurational
space of a chemical system to a distribution that is easy to sample
(Figure ). Such a
machine learning model can generate unbiased equilibrium samples,
following the Boltzmann distribution, in one shot. These machine learning
models, which were named Boltzmann generators, are illustrative examples
of the kind of new science that we can do using machine
learning that we could not do otherwise. They are conceptually different
from other established enhanced sampling techniques in that they do
not use any collective variable.
Figure 4
Boltzmann generators. An invertible neural
network is used to generate
independent samples that follow the desired Boltzmann distribution
of a molecular system. First, a sample point is chosen from a simple
distribution p(z), e.g., a Gaussian distribution. Then the neural network
transforms this sample to a configuration x that follows p(x), which is
a Boltzmann distribution similar to the one of the system. Lastly,
to compute the thermodynamic properties, the samples are reweighted
to their Boltzmann weight. The figure was redrawn based on refs (92) and (93).
Boltzmann generators. An invertible neural
network is used to generate
independent samples that follow the desired Boltzmann distribution
of a molecular system. First, a sample point is chosen from a simple
distribution p(z), e.g., a Gaussian distribution. Then the neural network
transforms this sample to a configuration x that follows p(x), which is
a Boltzmann distribution similar to the one of the system. Lastly,
to compute the thermodynamic properties, the samples are reweighted
to their Boltzmann weight. The figure was redrawn based on refs (92) and (93).
From Materials Properties to Performance and Application
Even if we know all of the thermodynamic and transport properties
of all of our materials, we still need to understand the techno-economic
and engineering requirements of the application in order to develop
performance metrics to objectively rank materials.[94] While this step crucially impacts our materials design
strategy, it is so challenging that we often avoid confronting it.
In particular, if we are carrying out research on novel materials,
a complete techno-economic metric will be nearly impossible. For example,
in many applications the costs will be an important factor.[95] However, how can we estimate the cost of a material
that has not yet been synthesized? In the case of MOFs, the abundance
of the metal and the complexity of the ligand can be good indications.
However, one also has to factor in whether the synthesis can be scaled
up easily.[96] Moreover, the engineering
design might be totally constrained by nonscientific factors. For
example, the adsorption pressure for the vehicular natural gas storage
application was set to 65 bar by the Advanced Research Project Agency—Energy
(ARPA-E), such that the process could be executed at home, while the
minimum discharge pressure was set to 5.8 bar.[97] If one would select a lower minimum discharge pressure,
materials with stronger adsorption sites for methane would become
more favorable. As a consequence, if these metrics are not well-defined
by external agencies, the metrics often become subjective and controversial;
each material can be shown to be exceptionally good for one particular
property. Therefore, only if we have an understanding of the relative
importance of all properties in the context of the full engineering
design of an application can we realistically evaluate whether a material
will make a real impact. We also need to keep in mind that such metrics
might give us the illusion that optimization of only one property
will lead us to breakthrough materials. However, because of the complexity
of the real-world application and the multistage design process, this
is usually not the case.One step toward unraveling this complexity
is to establish an understanding of how materials properties influence
the performance in an industrial process. For example, the overwhelming
complexity of the evolution of the coupled ordinary/partial differential
equations (ODEs/PDEs) underpinning mass and energy balance[98,99] often makes process modeling and optimization be seen as a black
box. Using machine learning, we might be able to shine some light
on how systems operate. Despite its significance, this topic has not
been widely explored to date.[100] In one
recent exceptional example, the effect of adsorbent properties on
the carbon capture performance was analyzed by Burns et al.[101] Interestingly, they found that the common shortcut
metrics for evaluation of materials are insufficient to predict the
process-level performance evaluation of materials.Besides,
measuring or computing the performance metric can become
a bottleneck in the case of complex processes and applications. An
illustrative example is the lifetime estimation of battery cells.[102] The typical lifetime of lithium iron phosphate/graphite
cell batteries varies over the range of 150 to 2300 cycles (Figure ). However, since
the battery capacity degradation undergoes a nonlinear process, it
is challenging to predict the cycle life from early cycles. For instance,
the capacity increases after 100 cycles for more than 81% of cells
(see Figure a). Therefore,
one needs to perform long cycle experiments, which often take months
to years to execute. Previously, voltage curves were used for degradation
diagnosis.[103] Hence, a machine learning
model that monitors the voltage curves from early cycles was developed
that can accurately (<4.9% test error) classify cells into long
and short cycle life using only the first five cycles (Figure b–d). By aborting the
long experiments of often hundreds or thousands of cycles for batteries
that are not promising, the authors could save huge experimentation
costs and time, allowing screening of a large number of candidates.
Figure 5
Prediction
of battery life cycle from early stages. (a) The cycle
life is shown with respect to cell capacity at cycle 100. (b, c) Characteristics
of the voltage curves of the first cycles were used as features to
develop the machine learning models. Q100 – Q10 is change in discharge
capacity between cycle 10 and 100. (d) Predictions of the machine
learning model for two test sets. The secondary set was generated
after model development. The vertical dashed line shows the 100th
cycle, where the predictions were made. The figure was redrawn based
on data from ref (102).
Prediction
of battery life cycle from early stages. (a) The cycle
life is shown with respect to cell capacity at cycle 100. (b, c) Characteristics
of the voltage curves of the first cycles were used as features to
develop the machine learning models. Q100 – Q10 is change in discharge
capacity between cycle 10 and 100. (d) Predictions of the machine
learning model for two test sets. The secondary set was generated
after model development. The vertical dashed line shows the 100th
cycle, where the predictions were made. The figure was redrawn based
on data from ref (102).
Exploring the Design Space
The final step is to explore
the chemical space to find the best-performing candidates. We know
that it is not feasible to exhaustively search the chemical space
simply because of the exploding number of possible structures. For
example, the number of theoretically feasible small drug molecules
was estimated to exceed 1060.[104] Ultimately, screening only the known materials or hypothetical structures
is not a solution, as these approaches cover only a limited part of
the chemical space and specifically can be biased because of human
choices or algorithmic limitations in structure generation.[14,41] Therefore, other search methods are desired to efficiently explore
the enormous chemical space.[11,105−107] Crucial in these algorithmic searches is the need to balance between
exploration, the process of probing the unseen regions of search space,
and exploitation, the process of probing the promising regions.A very popular class of discrete optimization methods is that of
evolutionary algorithms, in particular genetic algorithms (GAs). These
methods explore the space by evolving a population of structures through
a set of iterative nature-inspired operations to optimize an objective
(fitness) function (Figure a). Since the operations can be tailored and guided by chemical
rules, it is a popular choice for chemical design.[108] The idea is that the samples with higher fitness scores
have a better chance of survival and are selected more often to pass
their genes to new samples. The mutation and permutation of genes,
which could be functional groups of a ligand, control the ratio of
exploration and exploitation in search. High mutation allows for searching
of unexplored regions, while higher permutation ensures local searching.
Machine learning can be used to quickly evaluate the fitness of generated
samples, accelerating the search for materials discovery. Coupling
GA with machine learning has been successfully used for materials
synthesis,[109] discovery of transition metal
complexes,[105] and organic molecules.[110] In addition, active learning approaches, which
use uncertainty estimation in machine learning predictions, allow
exploration of regions of space that were not in the training set
by adding new data points to the training set on the fly when the
model is uncertain.
Figure 6
Methods for exploring chemical space. (a) Genetic algorithms
use
genetic operations to generate new samples that can quickly be evaluated
by a machine learning model to maximize the fitness score. (b) Variational
autoencoders (VAEs) learn a continuous lower-dimensional representation
(the latent space) that can be used for gradient-based optimization
of properties and recover the optimal chemicals by decoder. (c) Reinforcement-learning-based
approach that incorporates Monte Carlo tree search (MCTS) to complete
SMILES strings to generate new molecules, maximizing a reward function.
(d) In a generative adversarial model, the generator and discriminator
compete until the discriminator cannot distinguish generated samples
from real empirical samples. By generating new samples, one can explore
chemical space to maximize the properties of interest. The figure
was redrawn based on ref (11).
Methods for exploring chemical space. (a) Genetic algorithms
use
genetic operations to generate new samples that can quickly be evaluated
by a machine learning model to maximize the fitness score. (b) Variational
autoencoders (VAEs) learn a continuous lower-dimensional representation
(the latent space) that can be used for gradient-based optimization
of properties and recover the optimal chemicals by decoder. (c) Reinforcement-learning-based
approach that incorporates Monte Carlo tree search (MCTS) to complete
SMILES strings to generate new molecules, maximizing a reward function.
(d) In a generative adversarial model, the generator and discriminator
compete until the discriminator cannot distinguish generated samples
from real empirical samples. By generating new samples, one can explore
chemical space to maximize the properties of interest. The figure
was redrawn based on ref (11).Alternatively, one can use machine
learning methods for the generation
of structures.[111] In particular, in the
area of organic molecules that follow basic valence rules of Simplified
Molecular Input Line Entry System (SMILES) strings, recurrent neural
networks (RNNs) or transformers are powerful in completing or generating
new sequences of strings. RNNs and transformers have been developed
to treat data sequences such as data in natural language processing
or voice recognition. To guide and control the generation toward the
properties of interest, one powerful approach is Monte Carlo tree
search (MCTS). MCTS is used in reinforcement tasks, which involve
real-time decision-making for the next moves, e.g., in playing games[112] or control, with a large, complex, and open-ended
solution space. In analogy, we can think of the completion of a SMILES
string as an open-ended process with a target: we win if the properties
of interest improve (see Figure c). This approach has been found to be effective for
exploring chemical space for different applications, such as MOFs
for gas adsorption,[113] synthesis planning,[114,115] and the design of drug molecules.[116]A desired property to circumvent the expensive optimization in
discrete, enormous chemical space is to develop continuous and differentiable
representations of chemical structures. If we couple these continuous
representations with a generative model that converts a point in the
continuous space to a chemical structure, we can perform direct gradient-based
optimization of properties. Variational autoencoders (VAEs) are machine
learning models that try to do this by learning a lower-dimensional
representation of the data that is sufficient to regenerate the original
data (Figure b). The
chief component of a VAE is the lower-dimensional representation,
which is called the latent space. By mapping data points to their
probability distribution functions in the latent space, we can reach
the continuous representation of chemical structures. This approach
has recently been applied to organic molecules,[117,118] small molecular graphs,[119] solid-state
materials,[120] and porous materials.[121]An alternative method is generative adversarial
networks (GANs).
One neural network (the generator) generates new samples, and another
one (the discriminator) tries to classify the generated data and some
training data as fake or real (Figure d). The generator and the discriminator compete until
the generator is so good that the discriminator does not have a better
chance than 50% in distinguishing fake from real. GANs are finding
their position in molecular and materials design,[111] as exemplified by the generation of molecular graphs in
MolGAN[106] and an energy grid of guest molecules
and zeolite structures in ZeoGAN.[122]
Synthesis and Autonomous Experimentation
To practically
realize the aim of materials development, we need to be able to synthesize
the promising candidates discovered in the previous steps. However,
chemical synthesis is a complex, fuzzy process, and our theories are
still too limited to guide us through it. Therefore, chemical synthesis
mainly rests on unwritten heuristic rules that experienced chemists
gain over the course of many experiments. Data-driven approaches are
promising alternatives for inferring such chemical intuition if they
are presented with a sufficient number of failed and successful experiments.
This concept was demonstrated for the synthesis of organic,[123,124] inorganic,[125−128] and MOF[109] materials. Notably, these
data-driven methods have the privilege that their predictive performance
improves upon provision of more data.Coupling artificial intelligence
with robotic synthesis platforms has taken the idea of autonomous
laboratories and experimentation far.[14,129−131] The process of finding, optimizing, and executing synthesis is not
only tedious and resource-intensive but also prone to bias and error.
One can instead use robotic platforms to reduce the synthesis costs
and errors simultaneously, while using artificial intelligence to
control the robots. This approach has attracted tremendous attention
recently and has led to the development of software like ChemOS[132,133] and hardware like the Chemputer[134] to
perform experiments. Various methods have been used to guide these
robots, from conventional farthest-point sampling and genetic algorithms[109] to Bayesian optimization,[135−137] again trying to balance exploration and exploitation of chemical
synthesis space. The recent work of Burger et al.[137] introducing a mobile robotic chemist (Figure ) demonstrates how fast this
topic of research is growing, and it will be interesting to see whether
it leads to the discovery of novel chemistries.
Figure 7
A mobile robotic chemist.
The robot was used to perform an autonomous
search to find a photocatalyst for hydrogen production from water.
The robot improved the photocatalytic activity of the initial formulations
(indicated by the baseline) by a factor of 6 over 8 days of searching
the experimental space, performing 688 experiments. The photograph
of the robot was provided by Andrew I. Cooper and Benjamin Burger
(University of Liverpool). The figure was redrawn based on ref (137).
A mobile robotic chemist.
The robot was used to perform an autonomous
search to find a photocatalyst for hydrogen production from water.
The robot improved the photocatalytic activity of the initial formulations
(indicated by the baseline) by a factor of 6 over 8 days of searching
the experimental space, performing 688 experiments. The photograph
of the robot was provided by Andrew I. Cooper and Benjamin Burger
(University of Liverpool). The figure was redrawn based on ref (137).Another promising application of machine learning is in synthesis
planning. The challenge here is to identify a feasible route (i.e.,
the reaction steps, conditions, and reactants) for the synthesis of
chemical compounds starting from available chemicals. Ideally, we
want a program that takes a target structure as input and provides
a list of detailed feasible reaction steps, simultaneously minimizing
the number of steps and complexity of the process. Data-driven approaches
have recently been explored and shown to be promising for finding
the synthesis steps in organic retrosynthesis,[12,114,138,139] suggesting organic directing agents for the synthesis of zeolites,[140] and identifying new phases of inorganic compounds.[127]
Challenges and Opportunities
Data
An Ecosystem with Standards
The most elementary part
of machine learning studies is data. The basic standards
for data management have been explained in terms of the findable,
accessible, interoperable, and reusable (FAIR) guiding principles.[141] In essence, the FAIR principles require (meta)data
to be openly retrievable by a unique global and persistent identifier
as well as provided with the usage license and detailed provenance.
To fully meet the FAIR principles, we must not only develop and use
standardized ways of reporting data but also provide ways to access
the tools, protocols, codes, and input parameters so that the data
can be reproduced. Consequently, developing user-friendly, encouraging
ecosystems for sharing and programmatically accessing FAIR data is
a fundamental step and a key challenge to unlock the true power of
data-driven approaches in the chemical sciences. In addition, it is
now common that funding agencies ask for data management plans and
require that all data be made publicly available. However, systematically
doing these tasks requires a complete rethinking of the way we do
research, in which reproducibility and data sharing are the starting
point rather than an afterthought to meet the requirements of a journal
or funding agency.
Collection of Experimental Data
While machine learning
using failed experiments can be expected to be one of the most important
areas in chemistry,[109,128,142] a large body of these failed experiments remains unreported. It
is too demanding to expect researchers to spend a considerable fraction
of their time on documenting failed experiments. Instead, since data
are routinely generated over the course of a research project, solutions
that are fully integrated with experimental instruments are needed
in order to collect data while the user is performing the experiments.
Such platforms have remained underdeveloped in chemical sciences.
For example, electronic lab notebooks (ELNs),[143−145] allow sharing of protocols, postprocessing scripts, and measurement
techniques in a collaborative fashion as well as real-time data acquisition.
More importantly, ELNs can allow all of the data (failed and successful)
to be published in standard formats with little or no additional effort
on the part of the researcher. However, it is essential for the chemistry
community to embrace the development of such an open science infrastructure.
Reproducibility
Anyone who has tried to reproduce results
from the literature can testify that in many cases the articles do
not provide all of the information needed to reproduce the results.
In the case of computational results for example, often there are
unreported parameters (e.g., default parameters in a code), and the
reader of the article may be unaware of their importance. However,
if these parameters change over time or different ones are used in
different groups, it becomes impossible to reproduce the results.
The most simple solution is to publish all input files and all scripts
along with the article. However, managing this for large-scale calculations
using multiple codes becomes intractable, and therefore, one needs
a special infrastructure to be able to do this systematically. Recent
development of infrastructures in this area, such as Materials Cloud
and AiiDA,[35,146−148] and FireWorks[149] are opening promising
paths toward addressing these issues. Automation and workflow development
and execution tools for machine learning in materials science are
also under development, e.g., ChemML[150] and Automatminer.[151] Creating, maintaining,
and encouraging the use of these open science infrastructures require
the support of the computational chemistry community.
Data Curation
As important as they are, data management
and curation are among the least enjoyable, time-consuming, tedious,
and error-prone tasks. Specifically, since we deal with a large number
of data points, e.g., a large number of structures in databases like
the CSD, manual inspection is out of question, and the development
of new methods is inevitable. Exploring machine learning methods for
improving or even building new innovative ways of data curation is
an opportunity for future research in chemistry and materials science.
Such methods for automatic data curation have recently received attention
in many disciplines, including the chemical sciences.[152−155] For instance, by coupling uncertainty estimation methods exploiting
the statistical nature of machine learning methods, one can identify
mistakes and anomalies in big data. For example, in cases where the
machine learning model is confident in its predictions but large discrepancies
are observed with the reported data, the user can be warned to double-check
the entry to avoid mistakes in databases.
Data in the Literature
There is a large body of data
stored in the literature. Natural language processing for extracting
data from text and image and sequence processing techniques for analyzing
spectra would be potentially interesting. Unfortunately, a major obstacle
to overcome here is to convert Portable Document Format (PDF) to compatible
formats (e.g., plain text). In the future, it might be beneficial
for the scientific community to consider reporting in other formats
that are better suitable for machine interpretation.
Bias and Uncertainty in the Design Process
Novelty,
Bias, and Diversity
Most scientific efforts
have been focused on incremental improvements of some shortcut performance
indicators, for example, the adsorption capacity and selectivity of
MOFs for carbon capture. However, if we consider the full scope of
the design process, we realize that such materials properties are
only inputs for the next stage of the design (see Figure ). Therefore, the approach
based on incremental improvement of properties is not only limited
in finding the true optimal solutions for the full design process
but also introduces a strong bias by providing only limited options
for the next stage of the design process. For example, for most real-world
applications we need a trade-off between multiple properties, and
the optimization of only one objective will exclude many solutions
that might perform much better in the real problem. If we now also
consider that the properties we optimize are not necessarily good
surrogates for the practical application, we realize that focusing
on the optimization of these metrics will limit our ability to discover
novel materials, for which the application might be based on a mechanism
of which we are not yet even aware. For this reason, we argue that
for a broader perspective over the materials design process, enhancing
novelty in each stage will be a better path to success than the optimization
of single metrics. Essential here is the development of metrics that
allow measurement of such novelty in the evaluation of scientific
discoveries. Careful quantification of diversity by extending and
developing new metrics in all stages of materials design can help
us to reduce such bias.[41,156]
Uncertainty
Quantification and Error Propagation
Since
we are not using physical laws in machine learning models, it is crucially
important to be able to identify the domain of applicability of the
models for predictions of new systems. However, quantifying uncertainty
can be challenging and costly, and this topic has only recently received
some attention in the chemical sciences. Several methods have been
proposed for quantifying uncertainty,[51,157,158] such as measuring the distance of a new sample to
the training data or using ensemble models. Further studies are needed
to provide an understanding of the limitation of current methods,
to develop more reliable and cheap methods, and to provide guidelines
on choosing the method for quantification of uncertainty.In
addition, the error we make in a design process is not limited only
to machine learning predictions, as any simulation or experiment also
has a level of accuracy. It is therefore important to know how errors
propagate through the entire design process. Statistical methods can
be used to analyze the sensitivity of the outcomes to the inputs,
providing insight on the reliability and relevance of the entire process.
Structure–Property–Performance
Featurization
Further developments are required to
apply machine learning for those materials properties that require
a tensorial representation, are highly dependent on long-range interactions,
or involve dynamics. For example, we are still limited in featurization
of materials for those material properties that require tensorial
representation, including the stiffness tensor for mechanical properties,
the heat conduction tensor for heat transport, and the susceptibility
tensor for magnetic/electronic properties. In addition, current representations
are limited for properties that rely on structural dynamics. For example,
we are aware of the role of flexibility on adsorption properties of
soft porous materials (e.g., in MOFs), yet the commonly used representations
do not capture these subtleties.Additional developments are
needed for generative models. For example, sequence-based generative
models based on SMILES strings, which have been the main method for
generative design of chemicals, cannot generalize to chemistries that
do not follow valence-based rules.[11] Also,
using generative models with SMILES strings can generate problems
since many SMILES strings do not correspond to valid molecules. For
this reason, novel representations that are based on a formal grammar
have been developed.[159] We note that graph-theoretical
descriptors ignore any information related to geometry. Therefore,
for any materials and molecular properties that are sensitive to the
details of atomic coordinates and geometry, current generative models
are limited.
Molecular Simulation
The different
angles of machine
learning techniques for molecular simulation have advanced independently.
Examples of these techniques include “machine learned”
potentials,[79,83] enhanced sampling methods such
as Boltzmann generators,[92] and methods
for analysis of molecular dynamics trajectories.[160] The next step is to merge these methods into a toolbox
that can be used for different systems at scale. Since these methods
work hand in hand with conventional quantum and classical molecular
simulation methods, it is of great value to implement and couple them
in the existing simulation packages.[161,162]The
development of new modeling techniques will remain fundamentally important
for the future of the application of machine learning methods. One
of the main pillars of the fast development of data-driven methods
in recent years has been the abundance of data, mainly simulated big
data due to growing computational power. Hence, it will be continuously
important to improve the simulation methods and their accuracy, especially
for challenging problems such as nonlinear and noncontinuous phenomena
(e.g., instability and regime change), where we still rely heavily
on simulation.
Modeling Complex and Dynamic Processes
An interesting
field of research that has barely been explored for process modeling
is the use of machine learning to efficiently solve (nonlinear) partial
differential equations.[163,164] These methods have
shown great performance for solving complex Navier–Stokes equations
in fluid mechanics, e.g., for turbulence applications.[165] Adapting these for process modeling will not
only drop the computational costs but also allow the addition of more
levels of complexity in modeling by including nonideal effects that
are often ignored in such modeling.
Making Machine Learning
Comparable
The field of machine
learning for the design of materials would strongly benefit from establishing
reporting standards and using benchmark sets for model comparison
and evaluation. Tracking the successful path paved by the researcher
in the field of small organic molecules teaches us that using benchmark
data sets of molecules and their corresponding properties (labels)
allowed them to move fast by enabling them to build on top of previous
studies. Without a reference benchmark set of materials and labels,
one cannot compare the performance of different featurizations and
model architectures, as differences might originate in inhomogeneity
in data, such as differences in computational methodologies, or the
underlying distribution of structural databases and lack of diversity.
Hence, benchmark materials sets with consistent properties need to
be developed. Furthermore, since it is not trivial to compare models,
agreement on standard reporting methods is needed. A valuable step
forward was taken in this direction by Wang et al.[166] who provided guidelines for best practices of machine learning
for materials scientists.
New Learning Algorithms
Future research
on exploring
state-of-the-art machine learning methods and expanding them for the
chemical sciences is a significant opportunity. In particular, methods
like transfer learning, multitask learning, and one-shot learning,
which try to facilitate the learning process by transferring parameters
or features and/or sharing contextual information, are attractive
for cases in which we have little data for one materials class.
Causal and Interpretable Machine Learning
Explainable machine learning is opening a path toward obtaining
fresh insights and developing novel theories. In contrast to the general
perception of machine learning models as black boxes, interpreting
explainable models can shine some light on the connections between
the underlying structure and the corresponding property and performance.
For example, machine learning models can be seen as extremely flexible
empirical models that, similar to conventional empirical models, can
uncover profound novel understanding and knowledge and inspire new
theories if interpreted correctly. However, one needs to be cautious
to not fall into the trap of correlation versus causation. For example,
the number of sunburn cases is correlated with the amount of ice cream
sold in a city, which happens obviously because of the dry, hot, sunny
days in summer. However, not all cases are that obvious, and further
fundamental research is required to find methods to measure the trustworthiness
of explanations.[167,168]In particular, explainable
machine learning methods can potentially change the way we study phenomena
for which we still have limited theories. In a typical physical system,
one can assume that there are a few important terms, such as dimensionless
numbers in fluid mechanics, that govern the behavior of the system.
Therefore, using machine learning and symbolic equations, one can
try to extract the governing equations from large data.[71,72]
Synthesizability
Perhaps the greatest
challenge for computational and data-driven material design and discovery
is the synthesizability of discovered structures. Because of the great
progress in methods for inverse design, we can maneuver chemical space
to find the optimal materials and molecules. However, the full use
of this approach is hampered by our ignorance of the synthesizability
of the discovered structures. Even if we restrict our search to theoretically
valid structures, the discovered structures would often seem impossible
to synthesize. Therefore, developing universal solutions for biasing
the search for chemicals toward the synthetically accessible parts
of chemical space will be an important research direction in the future.
An interesting solution is to design structural motifs—that
can be incorporated into chemically synthesizable structures—instead
of the full structure. For example, instead of discovering a MOF that
is optimal for CO2 capture in the presence of water, Boyd
et al.[24] discovered a set of adsorption
sites, named adsorbaphores. In the next step, they generated a new
set of hypothetical structures that contain those adsorbaphores and
are also water-repellent. However, in this step, they restricted their
search to a set that was guided by experimentalists to be synthesizable.
This supervised search relied on the intuition of expert chemists.
Indeed, machine learning can help us here in inferring and capturing
this intuition. Further research in this direction needs to explore
the extent to which we can encode synthesizability into computational
and data-driven materials design.
Outlook
Machine
learning is transforming the way that we approach rational
materials design. The inherent complexity of searching the vast spaces
of options we face in the process of material design, from materials
to processes and applications, requires the development of methods
that work best in the limit of large numbers. Machine learning methods
provide us this toolbox. Using these methods, we can conceptualize
a new way of approaching materials design. The remarkable advancements
that we have reviewed in this Perspective are shown as proofs of principle
for the components of such an approach. By advancing and merging these
components, we can fully exploit all of these advances and realize
the power of data-assisted materials design. Indeed, there are still
significant challenges on the way, some of which have been mentioned
here. However, considering the fast progress in recent years, we can
envision that machine learning will be integrated into almost all
components of materials design and discovery in the near future.A pillar of success for the future of this approach is data. When
we rely on data to infer the solutions for our problems, the generation
of large-scale accurate and reproducible data is vital. Nevertheless,
we admit that it is one of the grand challenges for the future of
the field. In particular, overcoming some of the challenges on this
topic requires introducing new research cultures and collaboration
among multiple disciplines from sciences and engineering, including
both theoreticians and experimentalists. Therefore, only through an
open, disciplined, and collaborative environment based on agreements
on data reporting and protocols we will be able to move fast and use
the real power of data-driven methods for materials design.Most of the discoveries in the history of science were not purely
rational but relied on the intuition of scientists. Here one can see
the scientists as black boxes who have bright intuitions in decision-making.
The interesting fact about machine learning models is that once we
have a discovery or prediction, we can trace back the paths of decision-making
to uncover new insights. Therefore, we can now focus on tailoring
the future of materials design using the opportunities that machine
learning can bring to us for doing better science.
Authors: Ryosuke Jinnouchi; Jonathan Lahnsteiner; Ferenc Karsai; Georg Kresse; Menno Bokdam Journal: Phys Rev Lett Date: 2019-06-07 Impact factor: 9.161
Authors: Noam Bernstein; Bishal Bhattarai; Gábor Csányi; David A Drabold; Stephen R Elliott; Volker L Deringer Journal: Angew Chem Int Ed Engl Date: 2019-04-17 Impact factor: 15.336
Authors: Edward Kim; Kevin Huang; Alex Tomala; Sara Matthews; Emma Strubell; Adam Saunders; Andrew McCallum; Elsa Olivetti Journal: Sci Data Date: 2017-09-12 Impact factor: 6.444
Authors: H Shaun Kwak; Yuling An; David J Giesen; Thomas F Hughes; Christopher T Brown; Karl Leswing; Hadi Abroshan; Mathew D Halls Journal: Front Chem Date: 2022-01-17 Impact factor: 5.221