| Literature DB >> 31936321 |
Guang Chen1, Zhiqiang Shen1, Akshay Iyer2, Umar Farooq Ghumman2, Shan Tang3, Jinbo Bi4, Wei Chen2, Ying Li1,5.
Abstract
Organic molecules and polymers have a broad range of applications in biomedical, chemical, and materials science fields. Traditional design approaches for organic molecules and polymers are mainly experimentally-driven, guided by experience, intuition, and conceptual insights. Though they have been successfully applied to discover many important materials, these methods are facing significant challenges due to the tremendous demand of new materials and vast design space of organic molecules and polymers. Accelerated and inverse materials design is an ideal solution to these challenges. With advancements in high-throughput computation, artificial intelligence (especially machining learning, ML), and the growth of materials databases, ML-assisted materials design is emerging as a promising tool to flourish breakthroughs in many areas of materials science and engineering. To date, using ML-assisted approaches, the quantitative structure property/activity relation for material property prediction can be established more accurately and efficiently. In addition, materials design can be revolutionized and accelerated much faster than ever, through ML-enabled molecular generation and inverse molecular design. In this perspective, we review the recent progresses in ML-guided design of organic molecules and polymers, highlight several successful examples, and examine future opportunities in biomedical, chemical, and materials science fields. We further discuss the relevant challenges to solve in order to fully realize the potential of ML-assisted materials design for organic molecules and polymers. In particular, this study summarizes publicly available materials databases, feature representations for organic molecules, open-source tools for feature generation, methods for molecular generation, and ML models for prediction of material properties, which serve as a tutorial for researchers who have little experience with ML before and want to apply ML for various applications. Last but not least, it draws insights into the current limitations of ML-guided design of organic molecules and polymers. We anticipate that ML-assisted materials design for organic molecules and polymers will be the driving force in the near future, to meet the tremendous demand of new materials with tailored properties in different fields.Entities:
Keywords: data-driven algorithm; de novo materials design; machine learning; materials database; organic molecules; polymers
Year: 2020 PMID: 31936321 PMCID: PMC7023065 DOI: 10.3390/polym12010163
Source DB: PubMed Journal: Polymers (Basel) ISSN: 2073-4360 Impact factor: 4.329
Figure 1Machine learning (ML)-guided design of organic photovoltaics. (a–d) The performance of ML models on desired properties: , , filler factor (FF), and power conversion efficiency (%PCE); (e) top 10% screened molecules with highest predicted (green), (blue), and (red); (f) the most promising building blocks screened by model. The figures are adapted from Reference [70] with permission from The Royal Society of Chemistry.
Figure 2ML-guided design of dielectric polymers. (a) Three phases involved in this design approach; (b–d) the performance of ML model on the desired properties: electronic dielectric constant, ionic dielectric constant, and band gap; (e) the flow chart of genetic algorithm to identify promising candidates with desired properties; (f) the relation between number of building blocks and the number of possible polymers, as well as the percentage of the polymers needed to be considered; (g) the optimized molecular structures with 8∼12 units (C and H are not displayed explicitly). The figures are adapted from Reference [9] with permission, copyright 2016 Springer Nature.
Figure 3Integrated design of organic light-emitting diodes (OLEDs). (a) Schematic of the integrated design method; (b) flow chart of quantum chemical computation; (c,d) the coefficient of determinant for linear regression (0.80) and neural network (0.94); (e,f) the relation between hit fraction and root mean square error (RMSE) with respect to the training set size; (a–f) are adapted from Reference [83] with permission, copyright 2016 Springer Nature; (g) the best candidate molecular structures (gray, blue, and red nodes) denote carbon, nitrogen, and oxygen atoms, respectively.
Figure 4Design framework of polymeric solar cell. (a) Data flow of four different ML models (the gray shaded box in the bottom of it is a representation of donor-acceptor structure with X and Y the side groups, the number of which are variable); (b,c) performance comparison of different ML models; (d,e) molecules distribution in training dataset and new suggested molecules for (shaded area stands for target property range). These figures are adapted from Reference [89] with permission, copyright 2018 AIP publishing. DFT = density functional theory.
Figure 5Detonation property prediction of energetic materials. (a,b) The performance of the neural network model for prediction of detonation velocity and pressure; (c–e) prediction accuracy of LASSO, Gaussian process regression (GPR), and neural network (NN), respectively; (f,g) left: learning curves of ML model for detonation energy; right: detonation pressure. (a–e) are adapted from Reference [94] with permission. (f,g) are adapted from Reference [91] with permission, copyright 2018 Springer Nature.
Figure 6Accelerated design of polyimides (PIs) with high refractive index. (a) The core structure of PI with R1 and R2 group (blue and red nodes denote nitrogen and oxygen atoms, respectively); (b) the performance of the support vector regression (SVR) model (adapted from Reference [102] with permission, copyright 2018 AIP publishing); (c) 29 building blocks; (d) the distribution of molecules versus the RI values for R1, R2, and PIs; (e) RI values in terms of each building block for R1 and R2; (f) Z-score of building pairs for R1 and R2 (c–f) are adapted from Reference [100] with permission, copyright 2019 American Chemical Society.
Figure 7Integrated design for polymers with high thermal conductivity. (a) The proposed ML approach for materials discovery; (b) the performance of a direct learning algorithm; (c,d) validations of the trained linear regression model for glass transition temperature and melt temperature, respectively; (e) validation of the transfer learning; (f) the screened molecular candidates, in which the number are synthesized in red color; (g) validation of the synthesized molecules. The figures are adapted from Reference [106] with permission, copyright 2019 Springer Nature.
Figure 8De novo drug-like molecular design framework. (a) The proposed framework of reinforcement learning (RL) model; (b) flow chart of the predictive model; (c–e) properties distributions of RL model versus baseline model (no RL); (f,g,h) clustering of generated molecules. The figures are adapted from Reference [116] with permission, copyright 2018 AAAS.
Figure 9A framework for designing active layer of organic photovoltaic solar cells (OPVCs) via spectral density function. This figure is adapted with permission from Reference [128], copyright 2018 American Society of Mechanical Engineering. SDF = spectral density function.
Summary of the nine ML-guided materials design examples.
| Materials | Design Feature | Design Scope | Data Size | Representation | ML Model |
|---|---|---|---|---|---|
| Organic photovoltaics (2011) | Self-built library and screening | Power conversion efficiency (molecular level) | 2.6M | Molecular descriptors | MLR |
| Polymer dielectrics | Self-build library; building blocks for molecular generation; genetic algorithm | bandgap and dielectric constant (molecular level) | 284 | Fingerprints | KRR |
| Organic light-emitting diodes | Self-build library and screening; building blocks for molecular generation | delayed fluorescent rate constant (molecular level) | 40,000 | ECFPs | ANN |
| Polymer solar cell (2018) | Self-build library and screening; building blocks for molecular generation; various combinations of feature representations and ML models are compared | highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) (molecular level) | 3938 | Fixed length vector; string; spatial coordinate | LRR; MLP; RF; DTNN; GrammarVAE |
| High-energetic material | Material design with limited data; various combinations of feature representations and ML models are compared | high energy density and low sensitivity (molecular level) | 109; 309 | CDS; SoB; CM; BoB; fingerprints | KRR; RR; SVR; RF; kNN; LASSO; GPR; ANN |
| Polyimides with high refractive index | Self-build library and screening; building blocks for molecular generation; ML model construction with limited data | polarizability and number density (molecular level) | 196 | Number of monomer units | SVM |
| Polymer with high thermal conductivity | ML model construction with limited data; transfer learning | thermal conductivity (molecular level) | 28; 5917; 3234 | ECFPs | Bayesian model |
| de novo drug-like molecule | Material design with arbitrary target property range; SMILES strings as input for molecular generation | physical/chemical/biological properties (molecular level) | 1.5M | SMILES | DNN; RL |
| Organic photovoltaic solar cells (2019) | Polymer composite design; bottom-up nanofabrication; microstructure characterization and reconstruction | IPCEefficiency (microstructure level) | 45 | Microstructure characterization | SDF |
Note: ECFPs: extended-connectivity fingerprints; CDS: custom descriptor set; SoB: sum over bonds; CM: Coulomb matrix; BoB: bag of bonds; SMILES: simplified molecular-input line-entry system. MLR: multi-linear regression; KRR: kernel ridge regression; ANN: artificial neural network; LRR: linear ridge regression; MLP: multi-layer perceptron; RF: random forest; DTNN: deep tensor neural network; GrammarVAE: grammar variational autoencoder; KRR: kernel ridge regression; RR: ridge regression; SVR: support vector regression; kNN: k-nearest neighbors; GPR: Gaussian process regression; SVM: support vector machine; DNN: deep neural network; RL: reinforcement learning; SDF: spectral density function.
Some public materials databases enclosing structures and properties.
| Database | Type | Description | URL |
|---|---|---|---|
| AFLOWLIB | Computation | Database of 2,961,744 material compounds with over 527,190,432 calculated properties |
|
| BNPAH | Computation | Structures and properties of 77 polycyclic aromatic hydrocarbons and 33,059 B, N substituted compounds |
|
| ChemDiv | Comp./Exp. | Collection of over 1,500,000 individually crafted, lead-like, drug-like small molecules |
|
| ChemSpider | Experiment | A free chemical structure database providing fast text and structure search access to over 67 million structures |
|
| ChEMBL | Experiment | A manually-curated database of bioactive molecules with drug-like properties |
|
| Citrination | Experiment | A premier open database and analytics platform for the world’s material and chemical information |
|
| CMR | Computation | A collection of molecules obtained from electron-structure codes |
|
| COD | Experiment | A collection of crystal structures of organic, inorganic, metal-organics compounds, and minerals, excluding biopolymers |
|
| CSD | Experiment | A database of over one million small-molecule organic and metal-organic crystal structures |
|
| DrugBank | Experiments | Drug database with comprehensive drug target information |
|
| eMolecules | N/A | Commercially available with over seven million compounds for drug discovery |
|
| Energetics | Computation | A database of energetic molecules |
|
| GDB | Computation | A database containing hypothetical small organic molecules |
|
| HCEP | Computation | Harvard Clean Energy project for solar absorber materials |
|
| HOPV15 | Comp./Exp. | A collation of experimental photovoltaic data from the literature and calibrated by DFT calculation |
|
| ICSD | Experiment | A database of inorganic crystal structure |
|
| MatNavi | Experiment | A materials databases of polymer, ceramic, alloy, superconducting material, composite, and diffusion |
|
| MatWeb | Experiment | A database of material properties of polymers, metals, ceramics, and semiconductor |
|
| MP | Computation | Computed information on known and predicted materials |
|
| NIST CW | Experiment | A database of thermochemical properties |
|
| NIST MDR | Experiment | A repository of material data being updated |
|
| NOMAD | Computation | A repository to host, organize, and share material data |
|
| NREL MD | Computation | A computational materials database for renewable energy applications |
|
| OQMD | Computation | A database of DFT-calculated thermodynamic and structural properties |
|
| PubChem | Experiment | A chemical database of chemical and physical properties, biological activities, and safety and toxicity information |
|
| QM | Computation | Small organic molecules calculated by DFT |
|
| TEDesignLab | Comp./Exp. | Thermoelectric material design |
|
| ZINC | Computation | Database of commercially-available compounds for virtual screening |
|
Common feature representations of organic molecules and tools for feature generation.
|
|
|
|
| SMILES | Line notation for describing a chemical structure using text strings | [ |
| Fingerprints | A special descriptor using vector of fixed or variable length to represent a chemical structure | [ |
| Molecular graphs | A representation of chemical structures by graph theory | [ |
| Coulomb matrix | A matrix representation embedded nuclear coordinates and charges, similar representations include Ewald sum matrix, Sine matrix | [ |
| Smooth overlap of atomic orbitals (SOAP) | A special descriptor encoding atomic structures using local expansion of atomic density | [ |
| Atom-centereded symmetry functions (ACSF) | A special descriptor representing the local environment near an atom using two- or three-body functions | [ |
| Bag of bonds | A vector enclosing chemical bonds and corresponding numbers | [ |
| Grids of molecules | A visual form of molecules generated by their coordinates | [ |
|
|
|
|
| CDK | Chemistry Development Kit: open-source Java libraries for cheminformatics to generate various descriptors, fingerprints, etc. | [ |
| ChemDes | A free web-based tool for generation of molecular descriptors (3679 types) and fingerprints (59 types) | [ |
| ChemMine | A free online tool for analyzing and clustering small molecules, including similarity search and properties calculations | [ |
| OEChem | Programming library for chemistry and cheminformatics with small molecules | [ |
| Open Bable/Pybel | Open-source chemical toolbox to search, convert, analyze, and store data | [ |
| PaDEL | A software to generate molecular descriptors (1875 types) and fingerprints (12 types) using CDK | [ |
| PubChemPy | An open-source python library to interact with PubChem | [ |
| RDKit | A collection of cheminformatics and machine-learning tools | [ |
Figure 10The choice of ML models and descriptors that leads to different performance of ML predictions on methane uptake of metal-organic frameworks (MOFs). Reprinted with permission from Reference [191], copyright 2017 American Chemical Society.
Figure 11Typical molecular generation methods. (a) GDB molecular database generated by direct enumeration (adapted from Reference [23] with permission, copyright 2012 American Chemical Society); (b) high-energetic molecules generated by a material genome approach (adapted from Reference [52] with permission, copyright 2018 Springer Nature); (c) molecular generation by CNNs or RNNs with SMILES representation (adapted from Reference [60] with permission, copyright 2019 Royal Society of Chemistry); (d) molecular generation by generative adversarial network (GANs) using SMILES representation (adapted from Reference [25] with permission).
Figure 12Illustration of materials design approaches. (a) ML-assisted materials screening (adapted from Reference [205] with permission, copyright 2018 Springer Nature); (b) high-throughput virtual screening integrated with ML models (adapted from Reference [206] with permission, copyright 2019 Elsevier) and (c) inverse molecular design by RL; (d) integration of various modules for design of insulating nanocomposites by Bayesian optimization (BO). ECFP = extended connectivity fingerprints.