| Literature DB >> 35936396 |
Dhruv Menon1, Raghavan Ranganathan1.
Abstract
Despite its potential to transform society, materials research suffers from a major drawback: its long research timeline. Recently, machine-learning techniques have emerged as a viable solution to this drawback and have shown accuracies comparable to other computational techniques like density functional theory (DFT) at a fraction of the computational time. One particular class of machine-learning models, known as "generative models", is of particular interest owing to its ability to approximate high-dimensional probability distribution functions, which in turn can be used to generate novel data such as molecular structures by sampling these approximated probability distribution functions. This review article aims to provide an in-depth understanding of the underlying mathematical principles of popular generative models such as recurrent neural networks, variational autoencoders, and generative adversarial networks and discuss their state-of-the-art applications in the domains of biomaterials and organic drug-like materials, energy materials, and structural materials. Here, we discuss a broad range of applications of these models spanning from the discovery of drugs that treat cancer to finding the first room-temperature superconductor and from the discovery and optimization of battery and photovoltaic materials to the optimization of high-entropy alloys. We conclude by presenting a brief outlook of the major challenges that lie ahead for the mainstream usage of these models for materials research.Entities:
Year: 2022 PMID: 35936396 PMCID: PMC9352221 DOI: 10.1021/acsomega.2c03264
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1(a) Selected examples of timelines of the development of modern-day advanced materials versus the expectations of investors and typical development capacities. Reprinted with permission from ref (3). Copyright 2018 Elsevier. (b) The conventional seven-stage materials discovery, development, and optimization process, with each stage having complexities associated with it, making the overall process very time-consuming. Inspired from ref (5).
Figure 2(a) Simple neural network with an input layer, a single hidden layer, and an output layer. The node in the hidden layer has a bias (b1) associated with it, while the mapping from the input layer to the hidden layer has a set of weights (W1, W2, and W3) associated with it. Through the training process, these weights and biases are updated in order to optimize (minimize) the loss function. The function g (often referred to as the activation function) is usually a nonlinear function. (b) A representative illustration of a RNN where the current output depends not only on the current input but also the previous input, making it useful for the generation of sequential data. (c) A representative illustration of a VAE composed of encoder and decoder networks. The encoder converts a discrete data distribution to a continuous latent space, which is converted back to a discrete representation by the decoder. By moving in the latent space, new samples can be generated. (d) A representative illustration of a GAN composed of generator and discriminator networks. Using a probabilistic approach, the generator attempts to create a synthetic data distribution that is very similar to a real distribution, while the discriminator attempts to differentiate between real and synthetic data.
Figure 3Schematic of a workflow for developing a machine-learning model from a materials research perspective. Beginning with the identification of the objective function, the handling and curation of data and molecular featurization have established themselves to be two of the biggest challenges in the success of these models. Each step is critical for the model to work smoothly.
Nonexhaustive Set of Publicly Available Datasets That Can Be Used for a Wide-Range of Machine-Learning Algorithms in Materials Science
| data set | features | references |
|---|---|---|
| PubChem | Information about 109,907,032 unique chemical structures, 271,129,167 chemical entities, and 1,366,265 bioassays | ( |
| ZINC | Over 230,000,000 commercially available compounds for the purpose of virtual screening in drug discovery | ( |
| CheMBL | Over 2,086,898 bioactive molecules and 17,726,334 bioactivities for effective drug discovery | ( |
| ChemDB | Over 5,000,000 commercially available small molecules intended for drug discovery | ( |
| ChemSpider | Chemical structure data for over 100,000,000 structures from 276 data soures | ( |
| DrugBank | Over 200 data fields each for 14,853 small drugs (including 2687 approved small drugs) | ( |
| The Materials Project | Information about 131,613 inorganic compounds, 49,705 molecules, 530,243 nanoporous materials, and 76,194 band structures | ( |
| GDB9 | 17 properties each of 134,000 neutral molecules with up to nine atoms (CONF), with the exception of hydrogen | ( |
| Pauling File | Information about 310,000 crystal structures for 140,000 different phases, 44,000 phase diagrams, and 120,000 physical properties | ( |
| AFLOW | Information about 3,513,989 material compounds with over 695,769,822 computed properties | ( |
| HTEM DB | 37,093 compositional, 47,213 structural, 26,577 optical, and 12,849 electrical properties of thin films | ( |
| OQMD | DFT-calculated structural and thermodynamic properties of 815,654 materials | ( |
| SuperCon | Superconducting properties of 33,284 oxide and metallic samples and 564 organic samples from the literature | ( |
Figure 4(a) Representative illustration of a RNN model for the generation of drug-like molecules. Reprinted from ref (89) under the Creative Commons license (CC BY-NC-ND 4.0). (b) Representative of a VAE model for the generation of drug-like molecules. Reprinted from ref (29) under the Standard ACS AuthorChoice/Editors’ Choice Usage Agreement.
Figure 5(a) Two-dimensional PCA analysis plots of the variational autoencoder latent space, where the two axes represent the selected properties. The legend on the right of each plot indicates the value of that selected property. The plots in the leftmost column are for the VAE, the central column for the joint VAE-property prediction, and the rightmost column a sample of the latent space. As discussed in Section , in the case of the joint VAE-property prediction, depending on the value of the property, the molecules tend to segregate into localized regions. Reprinted from ref (29) under the Standard ACS AuthorChoice/Editors’ Choice Usage Agreement. (b) The top row represents the starting molecules, while the bottom row represents the generated molecules, along with their similarity index. Reprinted from ref (108) under the Creative Commons Creative Commons Attribution 4.0 International License.
Figure 6(a) Representative illustration of a workflow for the generation of potential donor–acceptor oligomers to serve as photovoltaic materials. Here the specialized RNN was finely tuned for the generation of the said oligomers. Reprinted from ref (116) under the Creative Commons license (CC BY-NC 3.0). (b) A representative of the architecture of the conditional GAN used to produce realistic data samples from existing HEA data sets by generating samples from the latent space. Reprinted from ref (128) under the Creative Commons license (CC BY-NC-ND 4.0).