Literature DB >> 33528245

Data-Driven Strategies for Accelerated Materials Design.

Robert Pollice^1,2, Gabriel Dos Passos Gomes^1,2, Matteo Aldeghi^1,2,3, Riley J Hickman^1,2, Mario Krenn^1,2,3, Cyrille Lavigne^1,2, Michael Lindner-D'Addario^1,2, AkshatKumar Nigam^1,2, Cher Tian Ser^1,2, Zhenpeng Yao^1,2, Alán Aspuru-Guzik^1,2,3,4.

Abstract

The ongoing revolution of the natural sciences by the advent of machine learning and artificial intelligence sparked significant interest in the material science community in recent years. The intrinsically high dimensionality of the space of realizable materials makes traditional approaches ineffective for large-scale explorations. Modern data science and machine learning tools developed for increasingly complicated problems are an attractive alternative. An imminent climate catastrophe calls for a clean energy transformation by overhauling current technologies within only several years of possible action available. Tackling this crisis requires the development of new materials at an unprecedented pace and scale. For example, organic photovoltaics have the potential to replace existing silicon-based materials to a large extent and open up new fields of application. In recent years, organic light-emitting diodes have emerged as state-of-the-art technology for digital screens and portable devices and are enabling new applications with flexible displays. Reticular frameworks allow the atom-precise synthesis of nanomaterials and promise to revolutionize the field by the potential to realize multifunctional nanoparticles with applications from gas storage, gas separation, and electrochemical energy storage to nanomedicine. In the recent decade, significant advances in all these fields have been facilitated by the comprehensive application of simulation and machine learning for property prediction, property optimization, and chemical space exploration enabled by considerable advances in computing power and algorithmic efficiency.In this Account, we review the most recent contributions of our group in this thriving field of machine learning for material science. We start with a summary of the most important material classes our group has been involved in, focusing on small molecules as organic electronic materials and crystalline materials. Specifically, we highlight the data-driven approaches we employed to speed up discovery and derive material design strategies. Subsequently, our focus lies on the data-driven methodologies our group has developed and employed, elaborating on high-throughput virtual screening, inverse molecular design, Bayesian optimization, and supervised learning. We discuss the general ideas, their working principles, and their use cases with examples of successful implementations in data-driven material discovery and design efforts. Furthermore, we elaborate on potential pitfalls and remaining challenges of these methods. Finally, we provide a brief outlook for the field as we foresee increasing adaptation and implementation of large scale data-driven approaches in material discovery and design campaigns.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33528245 PMCID： PMC7893702 DOI： 10.1021/acs.accounts.0c00785

Source DB: PubMed Journal: Acc Chem Res ISSN： 0001-4842 Impact factor: 22.384

Key References

.[1]Realization of an integrated inverse design workflow from high-throughput virtual screening to device testing for organic light-emitting diode materials. .[2]An automated nanoporous materials discovery platform powered by a supramolecular variational autoencoder was built and demonstrated for the efficient exploration of the near infinite reticular chemical space and inverse design of reticular materials with desired functions like gas separation. .[3]The proposal of a genetic algorithm enhanced by a neural network for inverse molecular design that can avoid convergence and bias molecule generation based on existing data sets. .[4]A probabilistic global optimization algorithm based on Bayesian kernel density estimation for the efficient parallel search of optimal experimental conditions.

Introduction

The tremendous rise of data science and machine learning (ML) in the last decades led to the suggestion that it constitutes the fourth pillar of science.[5] While data has always been at the heart of research, current hardware enables its utilization at an unprecedented scale.[5] Accordingly, our group, the Matter Lab, has been using ML extensively to accelerate the discovery of new materials, especially for clean energy technologies to combat climate catastrophe and enable innovative technologies. In this Account, we define discovery as observing a previously unknown natural phenomenon or object,[6,7] and design as rationally devising an object based on a particular plan.[8] Typically, discovery precedes and inspires materials design, as design requires at least minimal knowledge of the necessary features. Therefore, large scale discovery helps to speed up the establishment of material design principles, i.e., heuristics to realize particular designs, because they enable identifying patterns in known matter with desired properties. In turn, successful design catalyzes the realization of new materials by restricting the search space to only the most promising regions in subsequent campaigns. Herein, we review our work on organic electronic materials, crystalline materials, and data-driven methodologies for materials discovery and design, particularly high-throughput virtual screening, supervised learning, inverse molecular design, and Bayesian optimization. Moreover, we formulate general strategies for data-driven materials design our lab has adopted over the years and show how to implement them using ML. Finally, investigating these approaches critically, we propose typical use cases and highlight unsolved challenges.

Applications

Organic Electronic Materials

One of our research foci has been organic electronic materials.[9] Compared to silicon-based electronics, they offer several advantages, including low cost, low density, high mechanical flexibility and toughness, low energy consumption, and easy processability. Further, chemical derivatization is well-established, making the accessible candidate space vast. Accordingly, solar cells have experienced a remarkable surge because of the vast energy available from the sun and increasing efforts against a climate catastrophe. Organic photovoltaics[10] (OPVs) could replace commercial silicon-based devices if their power conversion efficiencies (PCEs) surpassed 10% and their lifetimes exceeded several thousands of hours. Notably, state-of-the-art OPVs reach 18% PCE in laboratory devices.[11] The Harvard Clean Energy Project (CEP) was initiated to find photoactive organic materials with high efficiencies.[12] Starting from 26 building blocks, selected based on expert knowledge to maximize performance and synthesizability,[13] 107 potential donors were generated. They were evaluated using high-throughput virtual screening (HTVS, vide infra) via increasingly expensive property predictions. First, the library was assessed using linear descriptor models constructed from experimental data. Subsequently, electronic structure calculations were performed, and PCEs were estimated using the Scharber model with a fullerene as acceptor.[14] That way, about 1000 candidates with estimated PCEs of 11% and higher were identified. Additionally, statistical analysis of the top-performing molecules revealed design principles for photoactive donors identifying building blocks more likely to exhibit high performance. Notably, the screening efforts led to the experimental characterization of an organic crystal with one of the highest reported hole mobilities reported at the time.[15] Subsequently, extending the CEP to nonfullerene acceptors, over 51 000 candidates were generated based on 107 expertly chosen fragments.[16] More sophisticated property calibration with Gaussian processes and a modified Scharber model improved PCE predictions with a well-studied electron donor. Overall, 838 molecules with predicted PCEs of 8% or larger were found. Moreover, statistical analysis of the candidate structures was performed with respect to both Morgan fingerprints and the building blocks, establishing a general architecture for nonfullerene acceptors. Similarly, organic light-emitting diodes[17] (OLEDs) have found wide adoption in small displays, are becoming prevalent in screens and lighting applications, and are entering the market in flexible displays. Thermally activated delayed fluorescence (TADF) emitters have become the main OLED class because of their high quantum efficiency, operational stability, and low cost. Their essential property is a small energy gap between the first excited singlet and triplet states so that energetically favored but nonemissive triplet excitons can be upconverted to emissive singlet excitons. Based on knowledge about the TADF mechanism, our group carried out HTVS of emitters covering 106 candidates (Figure ).[1] Key methodology included efficient quantum chemistry, calibrated against experiment via supervised learning (vide infra). Linear regression and neural networks were used for property predictions across the entire space.

Figure 1

Inverse design workflow for thermally activated delayed fluorescence organic emitters from selecting fragments to device integration and testing.

Inverse design workflow for thermally activated delayed fluorescence organic emitters from selecting fragments to device integration and testing. Exploration was performed iteratively using a neural network to predict the most promising candidates, which were then simulated, minimizing evaluations. Not only were known emitters rediscovered, but new structures were also uncovered. Additionally, the systematic exploration exposed both established property trade-offs and unknown property limits. Moreover, the best leads were evaluated by human experts concerning synthesizability and novelty. Consequently, the most promising molecules after both computer and human-based evaluations were synthesized and incorporated into devices leading to high external quantum efficiencies of over 20%. This study serves as a prototype for the entire data-driven discovery pipeline from defining the candidate space to device integration. Finally, renewable energy like wind and solar is intermittent, requiring large storage capacities to meet consumer demands. Redox-flow batteries (RFBs) resolve that by separating energy from power, enabling large grids to store immense amounts of energy scalable to varying demand loads.[18] Organic RFBs[19] (ORFBs) represent a sensible advancement, as redox-active organic electrolytes are tunable and cheaper than inorganic alternatives.[20] To identify ideal organic electrolytes, our group performed HTVS of quinones, which are well-known for their single-electron redox pairs.[21] The screening spanned 1710 single- and double-electron redox pairs to validate existing studies and find new redox couples. The results indicated that quinone-exclusive electrolytes were promising aqueous ORFBs and revealed that functionalizations near the carbonyl groups largely affected redox potential and those away largely affected solubility. Subsequently, several experimental studies verified these predictions.[22,23] However, decomposition was found to deteriorate battery capacity irreversibly.[24] Hence, our group performed combined computational and experimental studies on the decomposition of quinones in aqueous environments.[18] HTVS was performed for over 140 000 redox pairs, including decomposition product analysis. The results identified a trade-off between redox potential, with a maximum near 0.95 V, close to experimental results at 0.85 V,[25] and stability. These results provide roadmaps for future studies, which are ongoing in our group, as the trade-off suggests that electrolyte stability must be considered.

Crystalline Materials

Crystalline energy storage materials with high energy density at low cost are cornerstones of renewable energy applications. For instance, multivalent calcium ion batteries[26] (CIBs) improve upon monovalent lithium-ion counterparts through increased capacities and higher material abundance while maintaining comparable operating voltages.[27] However, the development of CIBs is hindered by the failure of traditional graphite and calcium metal anodes due to difficulties in intercalation and the lack of efficient electrolytes. Recently, a high voltage (4.45 V) CIB cell using tin as the anode was reported to achieve a remarkable cyclability (over 300 cycles).[28] Importantly, designing CIB anodes with improved performance requires a thorough exploration of the alloying space as calcium mixes with many elements. Hence, our group constructed a workflow to discover novel multivalent CIBs.[29] First, the tin electrochemical calciation reaction was investigated computationally and the reaction driving force as a function of calcium content was simulated. This exploration allowed the identification of threshold voltages governing the calciation limits. Consequently, a four-step screening strategy was adopted to look for high-performance CIB anodes. First, 357 metal–calcium binary and ternary compounds were identified from the Inorganic Crystal Structure Database (ICSD)[30] and further filtered to 115 candidates with existing decalciated metal/metalloid or binary intermetallic compounds. The calciation voltage profiles were calculated, and two threshold calciation voltages were defined, one stricter, based on the tin–calcium system, and the other more relaxed to account for potential differences in the driving force requirements. For each threshold, the maximum capacities, output voltages, volume expansions, and energy densities of the respective material were determined. Finally, metal–calcium systems with higher energy density than tin–calcium were identified, in which metalloids (Si, As, Sb, Ge), post-transition metals (Al, Pb, Cu, Cd, CdCu2, Ga, Bi, In, Tl, Hg), and noble metals (Ag, Pt, Pd, Au) showed promise as alloying candidates for CIB anodes and calls for further experimental validations. Additionally, reticular frameworks[31] (RFs), which include metal–organic frameworks (MOFs), are crystalline porous materials with high internal surface area and high stability and can be used for gas storage, gas separation, and electrochemical energy storage. They are constructed via self-assembly of molecular building blocks and exhibit a near-infinite combinatorial space, complicating their systematic exploration. Recently, our group developed an invertible and efficient RF representation (Figure ).[2,32] MOF fragments were extracted from the computation-ready, experimental (CoRE) MOF database[33] and augmented randomly with common functional groups. Furthermore, we added sets of multiconnected metal or organic nodes and sets of known MOF topologies generating a data set with around 2 × 106 MOF structures. Moreover, property simulations were performed for a random subset of about 40 000 MOF structures. The supramolecular variational autoencoder (SmVAE) with a MOF structure encoder-decoder, property prediction model, and framework generation algorithm was constructed with these structures (Figure ), which can locate high performing MOFs through property optimization in the latent space. We demonstrated its capabilities for automatic design by proposing top candidates for gas separation adsorbent materials. We believe that the MOFs discovered are highly competitive against the best-performing MOFs/zeolites ever reported. Currently, their performance was validated using computational methods. Nevertheless, experimental verification is under way. Furthermore, the as-built platform can be applied to various supramolecular systems (e.g., covalent-organic frameworks, coordination polymers, etc.) and applications (e.g., batteries, catalysis, drug delivery).

Figure 2

Automated reticular framework (RF) discovery platform using the supramolecular variational autoencoder (SmVAE). We construct the intermediate representation, RFcode, using unique, decomposed nets as a tuple of edges, vertices, and topologies. We consider the edges as SMILES, while vertices and topologies are categorical variables from known structures. SmVAE is a multicomponent variational autoencoder encoding and decoding each part of the RFcode separately (edge → x~edge, RFcom → ~RFcom). Structures are converted into/back from RFcode using the deconstructor/reconstructor, then transferred into continuous vectors (). To organize the latent space based on properties, we add a supervised model to predict properties (~property) based on labeled data (). Data from ref (2).

Methodology

High-Throughput Virtual Screening (HTVS)

Virtual screening[34] denotes a selection process of candidate materials. Chemicals, either generated on-the-fly or from databases, are subject to simulations that estimate application-specific properties. Candidates failing computational tests are rejected, with the proviso that predicted performance is likely translatable to experimental performance. Thus, HTVS is a technique that reduces large candidate spaces to a manageable set of promising materials (Figure ). In our search for new TADF emitters (vide supra),[1] the candidate space was narrowed down by 5 orders of magnitude via HTVS. Importantly, HTVS on large chemical spaces is inverse molecular design (vide infra) because, rather than designing structures directly, the computational tests and the candidate space are designed, which leads to the final hits based on the predicted properties.[35] Moreover, it can provide the basis for both generative and supervised models (vide infra), as they all rely on validated data.

Figure 3

High-throughput virtual screening starts from a large space of candidates (e.g., generated combinatorically, as illustrated). Using virtual screening, most candidates are eliminated, such that fewer (more expensive and time-consuming) experimental tests can be performed. Accordingly, HTVS is a powerful accelerator because computer simulation can be significantly less expensive than the respective experiments.[34] The continuing growth in computational power, which will soon reach the exascale, has made virtual screening highly scalable as it is embarrassingly parallel. Although HTVS is at least almost 20 years old,[36] it only recently started transforming materials science by advances in the accuracy and efficiency of density functional theory (DFT).[37] Besides computational cost, the main appeal of DFT was the possibility to tailor functional parameters to reproduce experiments, which increased its predictive power significantly. For instance, linear response time-dependent DFT (TD-DFT) is accurate and computationally inexpensive for excited state properties. More importantly, it is robust, can be used in a black-box manner, and is readily deployed in simulations of tens of thousands of molecules with minimal failure rates.[14] However, one pernicious failure mode of TD-DFT is the description of excited states with significant double-excitation character, which is, inter alia, important in describing molecules with inverted singlet–triplet gaps,[38,39] such as the INVEST emitters recently described by our group.[40] Nevertheless, as computing power is increasing, more sophisticated ab initio approaches can be used in HTVS, allowing one to tackle ever more complicated problems and new material classes. Yet, the impact of HTVS has been hampered by the difficulty in scaling the experimental confirmation of candidates,[1] as simulations feasible for high-throughput are still largely qualitative for condensed-phase properties.[41] A loose screen that accounts for computational inaccuracies minimizes false negatives, but the high cost of experimental validation means that almost all candidates must be rejected. The accuracy of computational screening can be maximized by implementing self-correcting filters such as checking whether simulations showed proper convergence catching false positives early on in the workflow. Nevertheless, ultimately, improvements in the experimental throughput are essential, calling for self-driving laboratories and closed-loop experimentation.[42,43]

AI-Powered Inverse Molecular Design

Inverse molecular design[35] starts at the desired properties and explores the chemical space to identify molecules optimizing them. Recently, various ML techniques have been employed to improve inverse molecular design, motivated by advances both on the algorithmic (powerful ML libraries) and the hardware sides (GPU improvements for large neural networks). Importantly, inverse molecular design approaches can be separated roughly into two classes: model-based ML algorithms and evolutionary techniques. Model-based ML algorithms for inverse design models use neural networks to learn patterns in molecular structures from existing data. After training, these models suggest new molecules covering important chemical features from the data set. Several methodologies exist. Herein we will discuss variational autoencoders (VAEs) and generative adversarial networks (GANs) because our group, to the best of our knowledge, was the first to apply these tools in chemistry. VAEs (Figure a) are capable of forming continuous (latent) spaces from discrete representations. They are trained to minimize the combined losses of latent space smoothness and input reconstruction enabling gradient-based optimization in the latent space. For inverse design, the latent space of VAEs is coupled with a property estimation model using supervised learning (vide infra).[44] Consequently, the latent space is arranged based on the property values allowing for a direct search of desired materials. GANs (Figure b) are generative models with joint training of two competing networks, a generator, and a discriminator. The generator produces examples from a high dimensional (often Gaussian) space, attempting to fool the discriminator, which tries to distinguish generated samples from reference structures. For molecules, our group proposed a sequential GAN (ORGAN), where the model is trained using reinforcement learning.[45] Desired molecular properties are used as a reward for generating good structures.

Figure 4

Inverse molecular design based on desired properties (F), with variational autoencoders (VAEs, a), generative adversarial networks (GANs, b), and genetic algorithms (GAs, c). Adapted with permission from ref (44). Copyright 2018 American Chemical Society. Notably, both VAEs and GANs are trained in a supervised way. Hence, they rely on existing data and mimic their distribution. Thus, they are limited in the exploration of the chemical space as compared to evolutionary techniques such as genetic algorithms (GAs, cf. Figure c). As its name implies, GAs are inspired by natural evolution. An initial population seeds the algorithm, each member being evaluated. The top-performing members proceed to the next iteration and the worst members are removed or replaced by better offspring. For inverse molecular design, the fitness function corresponds to the determination of desired molecular properties. In contrast to deep learning-based models, GAs are not biased by user-defined data sets. Therefore, they are superior in unbiased explorations.[3] Recently, we have shown that GAs augmented with neural networks to estimate the similarity of a molecule with a given data set can explore specific structural classes without the large data requirements of GANs and VAEs. Additionally, neural network-based learning was used to detect and avoid local minima trapping the GA to amplify exploration by avoiding convergence.[3] Notably, this shows that ML-based inverse design techniques can be effectively combined with evolutionary algorithms. Importantly, in all these approaches, molecular representation plays a crucial role. Molecular graphs are used for computational efficiency, as they avoid conformations. Simplified Molecular Input Line Entry System (SMILES)[46] strings are commonly used as a flat encoding of molecular graphs. However, they have a complex structure making a large fraction of molecules decoded from arbitrary SMILES invalid. This problem was solved recently by our group in a fundamental way by replacing SMILES with SELFIES (Self-Referencing Embedded Strings),[47] which is available on GitHub.[48] SELFIES is a 100% valid molecular string representation suitable as input for any inverse-design algorithm that outperformed alternative approaches in many benchmarks, such as validity and diversity of generated molecules, molecular density in the latent space of VAEs, or molecular optimization tasks with GAs.[3]

Bayesian Optimization

Several tasks across chemistry can be framed as optimization problems, where controllable parameters optimizing a desired objective are sought. For materials, such optimizations are challenging, as they are typically high-dimensional, nonconvex, and subject to noise and the objectives are expensive to evaluate. Suitable optimization strategies ought to be sample-efficient, global, and noise-tolerant. That is, they need to identify optimal parameter choices with as few measurements as possible, be able to escape local minima, and mitigate the detrimental effect of noise. A plethora of experiment planning strategies for optimization are currently available,[49] from traditional design of experiment to evolutionary and heuristic approaches. Among these, Bayesian optimization[50] (BO) has emerged as the strategy that best meets these requirements. BO is an experiment planning algorithm that, in contrast to most other approaches, uses an ML model to learn from previous observations before suggesting the next iteration (Figure a).[50] In its most widely adopted form, BO employs techniques such as Gaussian processes to build a surrogate model that captures the features of the underlying objective function. Based on this surrogate, an acquisition function is defined, which determines the strategy used to propose new experiments (Figure b). Just like BO formulations using different ML models exist, various acquisition functions have been developed. Due to the use of an ML model, BO is sample-efficient. It is also noise-tolerant, as these models explicitly account for it. Finally, BO is a global approach that balances the exploitation of the best local optima identified with the exploration of unprobed areas of parameter space.

Figure 5

(a) General pseudocode for Bayesian optimization. (b) Visualization of Bayesian optimization of an objective function (red curve) using Gaussian processes. (c) Examples of continuous-valued parameters compatible with Phoenics, along with a sample surrogate model and acquisition functions generated by the algorithm. Adapted with permission from ref (4). Copyright 2018 American Chemical Society. (d) Depiction of the representation of a categorical variable in Gryffin with three options (e.g., three ligands) on a simplex.[51] (e) Example of a multiobjective optimization problem for a chemical reaction, along with the construction of Chimera (bottom panel) from three 1-dimensional objective functions. Reproduced with permission from ref (52). Copyright 2018 Royal Society of Chemistry. Typical BO approaches are inherently sequential and require heavy computations for each iteration. Therefore, BO can be unduly expensive when used in conjunction with high-throughput evaluations. Thus, our group has developed Phoenics (Figure c), a linear-scaling BO approach that supports parallel experiments.[4]Phoenics employs Bayesian neural networks (BNNs) to build a kernel density estimate of the objective function, and its acquisition function allows for selection of batches of evaluations to be run in parallel. Importantly, Phoenics is suitable for the optimization of continuous parameters, such as temperature and concentration. To also optimize categorical parameters, such as the choice of solvent, we developed Gryffin (Figure d), which uses categorical kernel densities that can be relaxed to continuous ones.[51] In addition, Gryffin allows for expert knowledge, in the form of descriptors for each categorical choice, to be provided to improve the optimization efficiency. Often, multiple competing objectives are present in materials science. Chimera (Figure e) is a general-purpose approach to multiobjective optimization.[52] It allows defining a hierarchy of objective preferences, which are combined into a single function to be optimized with any algorithm of choice. Importantly, all the aforementioned algorithms can be combined with automated laboratories to enable autonomous experimentation.[42] These self-driving platforms are able to execute closed-loop workflows for the self-optimization of materials and processes. However, this requires robust software connections between automated hardware and experiment planning methods. ChemOS is a flexible, modular, open source and portable Python package that provides this interface between experiment planning and automated experiments.[53,54] Accordingly, in our laboratory, we have deployed ChemOS, together with Phoenics, Gryffin, and Chimera, for the autonomous optimization of manufacturing processes of thin-film materials,[55] multicomponent polymer OPV blends,[56] and reaction conditions of stereoselective Suzuki coupling.[57]

Supervised Learning

The costs associated with property measurement, from both experiments and simulations, are a major obstacle to the widespread expansion of HTVS, optimization, and inverse design. All of these techniques require some form of data acquisition, i.e., simulations, measurements, or data mining. However, adapting experimental design to suit the needs of automated protocols is challenging, despite self-driving approaches likely being overall cost-effective. The promise of accurate and practically free inference of new results from existing data via supervised learning is a major driver of the ongoing ML revolution in the physical sciences.[58] Supervised learning requires a data set of features and labels.[59] For molecular property prediction, this data set contains molecules in a specific representation (features) and their corresponding properties (labels). First, the data set is split into three, training, validation and holdout sets. The model is trained stepwise on the training set, usually by gradient descent or related algorithms. In general, hyperparameters, i.e., choice of features, training set, and model architecture, influence predictive performance. These hyperparameters are optimized by maximizing prediction accuracy on the validation set. Eventually, model performance is evaluated via prediction accuracy for the holdout set, and the final model can be used to predict properties for unlabeled molecules. The entire workflow is illustrated in Figure . Our group developed several model architectures for supervised learning of molecular properties, most notably graph convolutional neural networks.[60,61]

Figure 6

Workflow for supervised learning of molecular properties. A known (labeled) data set is used to optimize a model, which is subsequently used to estimate molecular properties for an unknown (unlabeled) data set. Importantly, supervised learning has been used successfully for materials discovery. For example, our group used the CEP data set for property prediction.[62] After training on more than 200 000 molecules, a neural network predicted the result of DFT calculations consistently at a fraction of the computational expense. Additionally, our group applied this approach to reduce the number of simulations in HTVS significantly, with training on a set of similar size.[1] Moreover, our group also used Gaussian process regression to calibrate for systematic errors in DFT.[16] Crucially, in these studies, ML algorithms, representations, acquisition of training data, and validation procedures for models were tightly integrated with an understanding of the problem space, as opposed to sole reliance on existing data from various sources. We believe these considerations are key when it comes to the practical application of ML in chemistry. Moreover, fruitful applications of supervised learning in materials science start from well-defined scientific goals. In contrast, the excitement brought upon by ML has generated many studies that focus on learning performance rather than scientific objectives. Generally, this is based on the (debatable and often unsupported) idea that performance metrics on one data set are transferable to other data sets or related problems. However, ML algorithms are highly parametrized and thus can readily overfit.[63] Indeed, the model choice can itself become a form of overfitting, especially when done on performance considerations alone.[64] Moreover, training data bias can contaminate predictions[65] but accounting for these biases appropriately is problem-specific. Furthermore, many studies are focused on error estimates obtained from statistical measures such as cross-validation. Although validation error can be a useful guide to the true prediction error on new data, it is not a replacement for it[66] and is often too optimistic.[67] In many ways, these issues arise when focus on the scientific goals is lost, as ultimately the best test of supervised learning is whether it solves problems.

Conclusion and Outlook

In this Account, we have reviewed data-driven approaches our group has employed for the design of materials, especially for clean energy applications, in the past decade. One of the first large scale campaigns our group embarked on was the CEP, where we implemented supervised learning together with HTVS using quantum chemistry simulations to investigate 107 potential donor molecules for organic solar cells and devised design principles by statistical analysis of structure–function relationships.[12] In the subsequent years, we refined these ML strategies and expanded our efforts toward other important materials such as OLEDs, OFRBs, multivalent CIBs, and RFs. In all these projects, data-driven workflows were key to speed up both the discovery and the design of new materials. However, we believe that the full potential of data-driven strategies is yet to be unleashed. For instance, many properties are currently not investigated in HTVS because of their prohibitive computational cost. One such property is molecular stability with respect to common decomposition pathways. The associated problem is the huge dimensionality of potential reactions molecules can undergo, which greatly exceeds the chemical compound space in complexity. Recently, our group developed a method for the automatic discovery of chemical reactions based on the selection of reactive internal coordinates such as weak chemical bonds.[68] We believe this approach, together with empirical rules or heuristics for selecting reactive internal coordinates, could be used for HTVS of reactivity and stability of materials, and research in that direction is ongoing. Other properties too prohibitive for HTVS include the influence of explicit solvation on spectroscopic properties and the direct simulation of amorphous solid-state structures and properties. The main challenge therein is the large number of particles and degrees of freedom in the model systems and the associated multitude of interactions. Furthermore, some of the methodologies we developed have only been tested on benchmark problems but are yet to be employed in real applications. Particularly, the genetic algorithm augmented with neural networks using SELFIES as molecular representation[47] our group proposed recently has outperformed most alternative generative models in benchmarks. However, it has yet to be implemented for designing functional materials, and we are actively working on that.[3] Finally, one of the most critical challenges of ML is model interpretability. Typically, supervised learning approaches are employed in a black box fashion without gaining insight into what the model actually learned. However, our group has shown recently that regression methods such as gradient boosting, when trained on molecular graph features, can be used to reveal important chemical moieties influencing the properties.[69,70] The trained model can be interpreted by human experts and rationalizing the feature importance can lead to new scientific understanding. We believe that similar approaches have the potential to change the way science is carried out in the near future. However, the bottleneck of materials design campaigns is experimental synthesis and characterization, usually by a large margin.[71] Any material, no matter how good its (predicted) performance, needs to be synthesized for it to be used in real life. In particular for clean energy applications, material syntheses need to be performed on a huge scale requiring reliable, safe and green chemical processes. Accordingly, the continuing speed-up in computer power providing unprecedented prediction capabilities needs to be paralleled by increased experimental throughput. Accelerating materials design ultimately requires close integration of computer simulation, ML and experimentation in self-driving platforms, which our group termed Materials Acceleration Platforms (MAPs).[43] One essential feature of MAPs is a closed-loop materials discovery workflow incorporating experimentation, computation, and human intuition. Online characterization techniques in conjunction with automated robotic synthesis[72−74] are central enabling technologies in these platforms. Making and measuring molecules on-demand in a feedback loop with self-correcting computational screening and ML is key to finding true “needle-in-a-haystack” materials. Currently, our group is implementing such an MAP for the realization of innovative materials making use of robust cross coupling chemistry, parallel robotic synthesis, and in-line characterization of spectroscopic properties coupled with computer simulation and ML. Details of this implementation will be described in an upcoming Account our group is working on in due course. Accordingly, the data-driven methods described above are a stepping stone to accelerate materials design. However, to realize their true potential, they need to percolate into experimental systems, and we are looking forward to witnessing applications of these methods in closed-loop experimental material design campaigns in the near future.

32 in total

1. New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design.

Authors: Alec Belsky; Mariette Hellenbrandt; Vicky Lynn Karen; Peter Luksch
Journal: Acta Crystallogr B Date: 2002-05-29

2. Organic synthesis in a modular robotic system driven by a chemical programming language.

Authors: Sebastian Steiner; Jakob Wolf; Stefan Glatzel; Anna Andreou; Jarosław M Granda; Graham Keenan; Trevor Hinkley; Gerardo Aragon-Camarasa; Philip J Kitson; Davide Angelone; Leroy Cronin
Journal: Science Date: 2018-11-29 Impact factor: 47.728

3. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors: Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal: Nat Biotechnol Date: 2010-07-30 Impact factor: 54.908

Review 4. ChemOS: Orchestrating autonomous experimentation.

Authors: Loïc M Roch; Florian Häse; Christoph Kreisbeck; Teresa Tamayo-Mendoza; Lars P E Yunker; Jason E Hein; Alán Aspuru-Guzik
Journal: Sci Robot Date: 2018-06-20

5. Inverted Singlet-Triplet Gaps and Their Relevance to Thermally Activated Delayed Fluorescence.

Authors: Piotr de Silva
Journal: J Phys Chem Lett Date: 2019-09-10 Impact factor: 6.475

6. Selection bias in gene extraction on the basis of microarray gene-expression data.

Authors: Christophe Ambroise; Geoffrey J McLachlan
Journal: Proc Natl Acad Sci U S A Date: 2002-04-30 Impact factor: 11.205

7. Reversible calcium alloying enables a practical room-temperature rechargeable calcium-ion battery with a high discharge voltage.

Authors: Meng Wang; Chunlei Jiang; Songquan Zhang; Xiaohe Song; Yongbing Tang; Hui-Ming Cheng
Journal: Nat Chem Date: 2018-04-23 Impact factor: 24.427

8. Discovery of diverse thyroid hormone receptor antagonists by high-throughput docking.

Authors: Matthieu Schapira; Bruce M Raaka; Sharmistha Das; Li Fan; Maxim Totrov; Zhiguo Zhou; Stephen R Wilson; Ruben Abagyan; Herbert H Samuels
Journal: Proc Natl Acad Sci U S A Date: 2003-05-30 Impact factor: 11.205

9. From computational discovery to experimental characterization of a high hole mobility organic crystal.

Authors: Anatoliy N Sokolov; Sule Atahan-Evrenk; Rajib Mondal; Hylke B Akkerman; Roel S Sánchez-Carrera; Sergio Granados-Focil; Joshua Schrier; Stefan C B Mannsfeld; Arjan P Zoombelt; Zhenan Bao; Alán Aspuru-Guzik
Journal: Nat Commun Date: 2011-08-16 Impact factor: 14.919

10. Machine learning dihydrogen activation in the chemical space surrounding Vaska's complex.

Authors: Pascal Friederich; Gabriel Dos Passos Gomes; Riccardo De Bin; Alán Aspuru-Guzik; David Balcells
Journal: Chem Sci Date: 2020-04-07 Impact factor: 9.825

17 in total

Review 1. Dye-sensitized solar cells strike back.

Authors: Ana Belén Muñoz-García; Iacopo Benesperi; Gerrit Boschloo; Javier J Concepcion; Jared H Delcamp; Elizabeth A Gibson; Gerald J Meyer; Michele Pavone; Henrik Pettersson; Anders Hagfeldt; Marina Freitag
Journal: Chem Soc Rev Date: 2021-11-15 Impact factor: 54.564

Review 2. Artificial Intelligence Applied to Battery Research: Hype or Reality?

Authors: Teo Lombardo; Marc Duquesnoy; Hassna El-Bouysidy; Fabian Årén; Alfonso Gallo-Bueno; Peter Bjørn Jørgensen; Arghya Bhowmik; Arnaud Demortière; Elixabete Ayerbe; Francisco Alcaide; Marine Reynaud; Javier Carrasco; Alexis Grimaud; Chao Zhang; Tejs Vegge; Patrik Johansson; Alejandro A Franco
Journal: Chem Rev Date: 2021-09-16 Impact factor: 72.087