| Literature DB >> 33528245 |
Robert Pollice1,2, Gabriel Dos Passos Gomes1,2, Matteo Aldeghi1,2,3, Riley J Hickman1,2, Mario Krenn1,2,3, Cyrille Lavigne1,2, Michael Lindner-D'Addario1,2, AkshatKumar Nigam1,2, Cher Tian Ser1,2, Zhenpeng Yao1,2, Alán Aspuru-Guzik1,2,3,4.
Abstract
The ongoing revolution of the natural sciences by the advent of machine learning and artificial intelligence sparked significant interest in the material science community in recent years. The intrinsically high dimensionality of the space of realizable materials makes traditional approaches ineffective for large-scale explorations. Modern data science and machine learning tools developed for increasingly complicated problems are an attractive alternative. An imminent climate catastrophe calls for a clean energy transformation by overhauling current technologies within only several years of possible action available. Tackling this crisis requires the development of new materials at an unprecedented pace and scale. For example, organic photovoltaics have the potential to replace existing silicon-based materials to a large extent and open up new fields of application. In recent years, organic light-emitting diodes have emerged as state-of-the-art technology for digital screens and portable devices and are enabling new applications with flexible displays. Reticular frameworks allow the atom-precise synthesis of nanomaterials and promise to revolutionize the field by the potential to realize multifunctional nanoparticles with applications from gas storage, gas separation, and electrochemical energy storage to nanomedicine. In the recent decade, significant advances in all these fields have been facilitated by the comprehensive application of simulation and machine learning for property prediction, property optimization, and chemical space exploration enabled by considerable advances in computing power and algorithmic efficiency.In this Account, we review the most recent contributions of our group in this thriving field of machine learning for material science. We start with a summary of the most important material classes our group has been involved in, focusing on small molecules as organic electronic materials and crystalline materials. Specifically, we highlight the data-driven approaches we employed to speed up discovery and derive material design strategies. Subsequently, our focus lies on the data-driven methodologies our group has developed and employed, elaborating on high-throughput virtual screening, inverse molecular design, Bayesian optimization, and supervised learning. We discuss the general ideas, their working principles, and their use cases with examples of successful implementations in data-driven material discovery and design efforts. Furthermore, we elaborate on potential pitfalls and remaining challenges of these methods. Finally, we provide a brief outlook for the field as we foresee increasing adaptation and implementation of large scale data-driven approaches in material discovery and design campaigns.Entities:
Year: 2021 PMID: 33528245 PMCID: PMC7893702 DOI: 10.1021/acs.accounts.0c00785
Source DB: PubMed Journal: Acc Chem Res ISSN: 0001-4842 Impact factor: 22.384
Figure 1Inverse design workflow for thermally activated delayed fluorescence organic emitters from selecting fragments to device integration and testing.
Figure 2Automated reticular framework (RF) discovery platform using the supramolecular variational autoencoder (SmVAE). We construct the intermediate representation, RFcode, using unique, decomposed nets as a tuple of edges, vertices, and topologies. We consider the edges as SMILES, while vertices and topologies are categorical variables from known structures. SmVAE is a multicomponent variational autoencoder encoding and decoding each part of the RFcode separately (edge → x~edge, RFcom → ~RFcom). Structures are converted into/back from RFcode using the deconstructor/reconstructor, then transferred into continuous vectors (). To organize the latent space based on properties, we add a supervised model to predict properties (~property) based on labeled data (). Data from ref (2).
Figure 3High-throughput virtual screening starts from a large space of candidates (e.g., generated combinatorically, as illustrated). Using virtual screening, most candidates are eliminated, such that fewer (more expensive and time-consuming) experimental tests can be performed.
Figure 4Inverse molecular design based on desired properties (F), with variational autoencoders (VAEs, a), generative adversarial networks (GANs, b), and genetic algorithms (GAs, c). Adapted with permission from ref (44). Copyright 2018 American Chemical Society.
Figure 5(a) General pseudocode for Bayesian optimization. (b) Visualization of Bayesian optimization of an objective function (red curve) using Gaussian processes. (c) Examples of continuous-valued parameters compatible with Phoenics, along with a sample surrogate model and acquisition functions generated by the algorithm. Adapted with permission from ref (4). Copyright 2018 American Chemical Society. (d) Depiction of the representation of a categorical variable in Gryffin with three options (e.g., three ligands) on a simplex.[51] (e) Example of a multiobjective optimization problem for a chemical reaction, along with the construction of Chimera (bottom panel) from three 1-dimensional objective functions. Reproduced with permission from ref (52). Copyright 2018 Royal Society of Chemistry.
Figure 6Workflow for supervised learning of molecular properties. A known (labeled) data set is used to optimize a model, which is subsequently used to estimate molecular properties for an unknown (unlabeled) data set.