| Literature DB >> 31660440 |
Hironao Yamada1, Chang Liu1,2, Stephen Wu1,3, Yukinori Koyama2, Shenghong Ju4, Junichiro Shiomi2,4, Junko Morikawa2,5, Ryo Yoshida1,2,3.
Abstract
There is a growing demand for the use of machine learning (ML) to derive fast-to-evaluate surrogate models of materials properties. In recent years, a broad array of materials property databases have emerged as part of a digital transformation of materials science. However, recent technological advances in ML are not fully exploited because of the insufficient volume and diversity of materials data. An ML framework called "transfer learning" has considerable potential to overcome the problem of limited amounts of materials data. Transfer learning relies on the concept that various property types, such as physical, chemical, electronic, thermodynamic, and mechanical properties, are physically interrelated. For a given target property to be predicted from a limited supply of training data, models of related proxy properties are pretrained using sufficient data; these models capture common features relevant to the target task. Repurposing of such machine-acquired features on the target task yields outstanding prediction performance even with exceedingly small data sets, as if highly experienced human experts can make rational inferences even for considerably less experienced tasks. In this study, to facilitate widespread use of transfer learning, we develop a pretrained model library called XenonPy.MDL. In this first release, the library comprises more than 140 000 pretrained models for various properties of small molecules, polymers, and inorganic crystalline materials. Along with these pretrained models, we describe some outstanding successes of transfer learning in different scenarios such as building models with only dozens of materials data, increasing the ability of extrapolative prediction through a strategic model transfer, and so on. Remarkably, transfer learning has autonomously identified rather nontrivial transferability across different properties transcending the different disciplines of materials science; for example, our analysis has revealed underlying bridges between small molecules and polymers and between organic and inorganic chemistry.Entities:
Year: 2019 PMID: 31660440 PMCID: PMC6813555 DOI: 10.1021/acscentsci.9b00804
Source DB: PubMed Journal: ACS Cent Sci ISSN: 2374-7943 Impact factor: 14.553
Figure 1Neural transfer learning with frozen featurizers. In this example, a fully connected pyramid neural network is first trained using training instances for the monomeric C. A subnetwork other than the output layer is used as a feature extractor and is repurposed on a model of the polymeric C.
Summary of Models Trained in This Studya
| material type | database | property | model type | model parameters | no. of models | best model correlation | no. of descriptors | descriptor type |
|---|---|---|---|---|---|---|---|---|
| organic | PoLyInfo (polymer) | glass transition temperature | RF-R | RF setup 1 | 1,000 | 0.950 | max 500* | rcdk-all |
| GB-R | GB setup | 1,000 | 0.950 | max 500* | rcdk-all | |||
| EN-R | EN setup | 1,000 | 0.920 | max 500* | rcdk-all | |||
| NN-R | NN setup 1 | 1,000 | 0.950 | max 400–600# | rcdk-all | |||
| NN-Py | NN setup 2 | 500 | 0.955 | 2,048 | RDKit-5 | |||
| density | NN-R | NN setup 1 | 1,000 | 0.910 | max 400–600# | rcdk-all | ||
| NN-Py | NN setup 2 | 500 | 0.859 | 2,048 | RDKit-5 | |||
| viscosity | NN-R | NN setup 1 | 1,000 | 0.890 | max 400–600# | rcdk-all | ||
| NN-Py | NN setup 2 | 500 | 0.613 | 2,048 | RDKit-5 | |||
| melting temperature | NN-R | NN setup 1 | 1,000 | 0.880 | max 400–600# | rcdk-all | ||
| NN-Py | NN setup 2 | 500 | 0.885 | 2,048 | RDKit-5 | |||
| heat capacity (const pressure) | NN-R | TL setup 1 | 25,000 | 0.992 | max 400–600# | rcdk-all | ||
| thermal conductivity | NN-R | TL setup 1 | 25,000 | 1.000 | max 400–600# | rcdk-all | ||
| QM9 (small molecule) | heat capacity at constant volume | NN-R | NN setup 1 | ∼500 | 0.900 | max 400–600# | rcdk-all | |
| LUMO | NN-R | NN setup 1 | ∼500 | 0.950 | max 400–600# | rcdk-all | ||
| HOMO–LUMO gap | NN-R | NN setup 1 | ∼500 | 0.940 | max 400–600# | rcdk-all | ||
| zero point vibrational energy | NN-R | NN setup 1 | ∼500 | 0.940 | max 400–600# | rcdk-all | ||
| internal energy at 0 K | NN-R | NN setup 1 | ∼500 | 0.920 | max 400–600# | rcdk-all | ||
| enthalpy at 298.15 K | NN-R | NN setup 1 | ∼500 | 0.910 | max 400–600# | rcdk-all | ||
| free energy at 298.15 K | NN-R | NN setup 1 | ∼500 | 0.910 | max 400–600# | rcdk-all | ||
| HOMO | NN-R | NN setup 1 | ∼500 | 0.880 | max 400–600# | rcdk-all | ||
| internal energy at 298.15 K | NN-R | NN setup 1 | ∼500 | 0.880 | max 400–600# | rcdk-all | ||
| isotropic polarizability | NN-R | NN setup 1 | ∼500 | 0.870 | max 400–600# | rcdk-all | ||
| electronic spatial extent | NN-R | NN setup 1 | ∼500 | 0.800 | max 400–600# | rcdk-all | ||
| dipole moment | NN-R | NN setup 1 | ∼500 | 0.740 | max 400–600# | rcdk-all |
RF-R, GB-R, EN-R, and NN-R denote models obtained from the ranger package (random forest), xgboost package (gradient boosting), glmnet package (elastic net), and MXNet package (neural network) in R, respectively. NN-Py and RF-Py denote neural networks trained with PyTorch and random forest trained with scikit-learn in Python, respectively. CGCNN-Py denotes the crystal graph convolution neural network in PyTorch. The hyperparameters of each model were randomly selected from fixed ranges. RF setup 1 indicates the number of trees (nTree) ∈ [100,800] and the number of randomly chosen features (mTry) ∈ [20,100]. RF setup 2 denotes nTree ∈ [50,500] and mTry ∈ [50,500]. GB setup denotes the learning rate (eta) ∈ [0.1,1], the maximum tree depth (max_depth) ∈ [3,10], and the maximum number of boosting iterations (nrounds) ∈ [50,200]. EN setup denotes the elastic net mixing parameter (alpha) ∈ [0,1] with the Gaussian-response-type family and randomly selected λ. NN setup 1 denotes the number of epochs ∈ [3,000,4,000], the number of hidden layers ∈ [3,4]. Furthermore, the maximum number of nodes in the first hidden layer equal to 400 and the number of nodes in the last layer ∈ [10,30]. NN setup 2 was the same as NN setup 1 except the maximum number of nodes in the first hidden layer was 1640. NN setup 3 denotes the number of epochs ∈ [1000, 3000], the number of hidden layers ∈ [3,6], with the maximum number of nodes in the first hidden layer given by 348 and the minimum number of nodes in the last layer given by 5. TL setup 1 denotes the use of the last hidden layer of a source neural network (N nodes) as an input for RF-R with randomly picked hyperparameters: nTree ∈ [half of the number of the training samples,the number of training samples] and mTry ∈ [N/2,N]. TL setup 2 denotes the use of a randomly chosen subset of all the hidden layers of the SPS best model as an input for RF-Py. Randomly selected hyperparameters were employed: nTree = 200, the maximum number of features = square root of the number of descriptors. For descriptor types, rcdk-all denotes combining all available fingerprints in rcdk (standard, extended, graph, hybridization, maccs, estate, pubchem, kr, circular); RDKit-5 denotes atom pairs and topological torsions fingerprints, Morgan fingerprints (with and without feature-based), and basic fingerprints in RDKit; XenonPy denotes compositional and RDF descriptors in XenonPy. The symbol * denotes cases that, after fingerprint entries showing zero in more than 90% of the training instances were removed from a total of 11 106 bits, some of the remaining entries were randomly discarded until the number of remaining entries reached at most 500. The symbol # denotes cases identical to those of *, except the remaining fingerprint entries after the filtering were randomly dropped down to, at most, X entries, where X is randomly picked from a given range. Furthermore, % indicates that the 3600 models consist of three sets of 1200 models that correspond to the compositional and RDF descriptors for stable structures and the compositional descriptor for unstable structures, respectively.
Figure 2Illustrative example of transfer learning for prediction of polymeric C. (a) The left two panels show prediction performance of a directly supervised random forest and the best transfer learning model using 58 instances of the polymeric C under 5-fold CV. The predicted and experimentally measured properties are shown on the horizontal and vertical axes, respectively, color-coded in shades of red (blue: fits to the training data in the CV). The best transfer learning model is obtained from 1000 pretrained source models for the C of small molecules, which had randomly generated different networks. The transferred polymeric C model exhibiting the minimum MAE value was identified through the same 5-fold CV. The right panel shows a plot of the MAE values for the 1000 pretrained models on the source task (the monomeric C) and their transferred models on the target task (the polymeric C). (b) Same layout as (a), except the models were trained with the stratified group 6-fold CV; all the polymers were divided into nonoverlapping six subgroups according to their compositional and structural features, and the CV was looped with this grouping. (c) Heatmap display of neural descriptors acquired from C and repurposed on C. For each layer in the C network, we calculated the n × p descriptor matrices with the chemical structures given in the C data set, where p is the number of neurons and n is the number of samples on C. In all the heatmaps, the n samples are sorted from top to bottom in increasing order of C.
Figure 3Transfer learning (TL) for λ of polymers using 19 observations. (a) The upper left plot shows 19 observed properties against predicted values given by directly trained random forests. The other panels present the prediction performance of transferred random forests trained using neural network features acquired from prelearning on C (small molecules), and the viscosity, ρ, Tg, and Tm of polymers. The predicted and fitted values in the 5-fold CV are colored orange and blue, respectively. (b) Scatter plot matrix of observed properties in PoLyInfo for Tg (°C), Tm (°C), ρ (g/cm3), viscosity (η, dL/g) in log scale, C (cal/g °C), and λ (W/mK). (c) Prediction performance of transferred random forests trained using neural network features acquired from prelearning of the ionization energy (Eion), n, cohesive energy (Ecoh), Hildebrand solubility parameter (δ), and electronic dielectric constant (ϵe) in Polymer Genome.
Figure 4Extrapolation ability of transferred models for predicting λ. (a) Prediction of λ for three polymers that were newly synthesized in our previous study[47] (left: directly learned random forest, right: the best transferred model). (b, c) Chemical structures of the three new polymers and the 19 training polymers used in the transfer learning.
Figure 5Transfer learning for LTC of inorganic compounds. (a) Scatter plot of data on SPS and LTC. (b) Prediction performance of model exhibiting best transferability among 1000 pretrained models. The validation and training results in the 10-fold CV are colored orange and blue, respectively. (c) Histogram showing LTC distributions for 45 training samples and 14 crystals having ultrahigh LTC identified by HTS. In the prediction-observation plot in the inset, the orange dots and blue diamonds denote the predicted values of the transferred model and of a neural network directly trained using the 45 samples, respectively, to demonstrate the extrapolation prediction performance.
Figure 6Transfer learning across organic and inorganic materials. (a) Heatmap display of 290 compositional descriptors for 853 organic polymers (upper half) and 1056 inorganic compounds (lower half). The upper and lower half samples are separately sorted from top to bottom by increasing order n of organic polymers and inorganic compounds, respectively. (b) Projection of the 290 compositional descriptors onto two-dimensional space through t-SNE. The organic polymer and inorganic compound samples are colored red and blue, respectively. (c) Transfer learning performance from inorganic compounds to organic polymers. (Left) Prediction performance for n of organic polymers using model trained on inorganic compound data. (Right) Prediction results of best transferred model.