| Literature DB >> 36134280 |
Jiazhen Cai1, Xuan Chu1, Kun Xu1, Hongbo Li2, Jing Wei1,2.
Abstract
New materials can bring about tremendous progress in technology and applications. However, the commonly used trial-and-error method cannot meet the current need for new materials. Now, a newly proposed idea of using machine learning to explore new materials is becoming popular. In this paper, we review this research paradigm of applying machine learning in material discovery, including data preprocessing, feature engineering, machine learning algorithms and cross-validation procedures. Furthermore, we propose to assist traditional DFT calculations with machine learning for material discovery. Many experiments and literature reports have shown the great effects and prospects of this idea. It is currently showing its potential and advantages in property prediction, material discovery, inverse design, corrosion detection and many other aspects of life. This journal is © The Royal Society of Chemistry.Entities:
Year: 2020 PMID: 36134280 PMCID: PMC9419423 DOI: 10.1039/d0na00388c
Source DB: PubMed Journal: Nanoscale Adv ISSN: 2516-0230
Fig. 1The machine learning workflow; the place of feature engineering is shown in the red circle.[11]
Fig. 2An overview of the application of machine learning in materials science.[12]
An overview of some of the most authoritative databases in material science
| Title | Website address | Brief introduction |
|---|---|---|
| AFLOWLIB |
| A global database of 3 249 264 material compounds and over 588 116 784 calculated properties |
| ASM Alloy Database |
| An authoritative database focusing on alloys, mechanical and alloy phases, and failed experiment data |
| Cambridge Crystallographic Data Centre |
| It focuses on structural chemistry and contains over 1 000 000 structures |
| ChemSpider |
| A free chemical structural database providing fast searching access to over 67 000 000 structures |
| Harvard Clean Energy Project |
| A massive database of organic solar cell materials |
| High Performance Alloys Database |
| This high performance alloy database addresses the needs of the chemical processing, power generation and transportation industries |
| Materials Project |
| It offers more than 530 000 nanoporous materials, 124 000 inorganic compounds and power analysis tools for researchers |
| NanoHUB |
| An open source database focusing on nanomaterials |
| Open Quantum Materials Database |
| It contains substantial amounts of data on the thermodynamic and structural properties of 637 644 materials |
| Springer Materials |
| A comprehensive database covering multiple material classes, properties and applications |
Fig. 4A recipe for proceeding from data to fingerprinting descriptors to insights to models and discovery.[35]
Fig. 3Evolution of the workflow of machine learning in nanomaterials discovery and design. (a) First-generation approach. In this paradigm, there are two main steps: feature engineering from raw database to descriptors; model building from descriptors to target model. (b) Second-generation approach. The key characteristic that distinguishes it from the first-generation approach is eliminating human-expert feature engineering, which can be directly learned from raw nanomaterials.[29]
Fig. 5(a) Performance of SVM-based classifiers in distinguishing between PVPs and non-PVPs. The red arrow denotes the final selected model. (b) Classification accuracy of the final selected model. (c) Distribution of each feature type in the optimal feature set (136 features) and original feature set (583 features).[59]
Fig. 6(a) Schematic of a single decision tree.[32] (b) Taylor diagram of different models for UCS prediction.[69]
Fig. 7(a) An example of a feed-forward ANN with N hidden layers and a single neuron in the output layer.[80] (b) Schematic of an ANN-evaluated genetic algorithm.[74]
Fig. 8Schematic overview of data annotation and the deep learning pipeline in neurodegenerative disease diagnosis.[84]
An overview of some basic machine learning algorithms
| Algorithm | Brief introduction | Advantages | Disadvantages | Representative applications |
|---|---|---|---|---|
| Regression analysis | It can find regression equations and predict dependent variables | Deeply developed and widely used in many occasions | Needs large amounts of data and may cause overfitting in practical applications | Machine learning with systematic density-functional theory calculations: application to melting temperatures of single-and binary-component solids |
| Naïve Bayes classifier | It can classify data into several categories following the highest possibility | Only a small amount of data is needed to obtain essential parameters | The feature independence hypothesis is not always accurate | A naïve-Bayes classifier for damage detection in engineering materials |
| Support vector machine | SVM can find a hyperplane to divide a group of points into two categories | It has great generalization ability and can properly handle high-dimension datasets | SVM is not very appropriate for multiple classification problems | PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine |
| Decision tree and random forest | By splitting source datasets into several subsets, all data will be judged and classified | The calculating processes are easy to comprehend. Also, it can handle large amounts of data | It is difficult to obtain a high-performance decision tree or a random forest. Also, the overfitting problem may occur | High-throughput machine-learning-driven synthesis of full-Heusler compounds |
| Artificial neural network | By imitating neuron activities, ANN can automatically find underlying patterns in inputs | ANN has great self-improving ability, great robustness and high fault tolerance | Its inner calculation progresses are very difficult to understand | Learning from the Harvard Clean Energy Project: the use of neural networks to accelerate materials discovery |
| Deep learning | Originated from ANN. It aims to build a neural network to analyze data by imitating the human brain | It has the best self-adjusting and self-improving abilities compared with other ML methods | As a new trend in ML, deep learning has not yet been well studied. Many defects are still unclear | Artificial intelligence in neuropathology: deep learning-based assessment of tauopathy |
Fig. 9Results and insights from the ML model. (a) The fitting results of the test bandgaps EPBEg and predicted bandgaps EMLg. (b) Scatter plots of tolerance factors against the bandgaps for the prediction dataset from the trained ML model (the blue, red and dark gray plots represent the training, testing and prediction sets, respectively). Data visualization of predicted bandgaps for all possible HOIPs (each color represents a class of halogen perovskites) with the (c) tolerance factor, (d) octahedral factor, (e) ionic polarizability for the A-site ions, and (f) electronegativity of the B-site ions. The dotted boxes represent the most appropriate range for each feature.[28]
Fig. 10(a) Schematic of different approaches toward molecular design. Inverse design starts from desired properties and ends in chemical space, unlike the direct approach, which leads from chemical space to the desired properties.[121] (b) Computer vision analysis of pictures to detect rail defects.[123] (c) Computational high-throughput screening for |ΔGH| on 256 pure metals and surface alloys. The rows indicate the identities of the pure metal substrates, the columns indicate the identities of the solutes embedded in the surface layers of the substrates, and the diagonal of the plot corresponds to the hydrogen adsorption free energy of the pure metals.[131]