| Literature DB >> 34345637 |
Ashley N Henderson1, Steven K Kauwe1, Taylor D Sparks1.
Abstract
Materials discovery via machine learning has become an increasingly popular method due to its ability to rapidly predict materials properties in a time-efficient and low-cost manner. However, one limitation in this field is the lack of benchmark datasets, particularly those that encompass the size, tasks, material systems, and data modalities present in the materials informatics literature. This makes it difficult to identify optimal machine learning model choices including algorithm, model architecture, data splitting, and data featurization for a given task. Here, we attempt to address this lack of benchmark datasets by assembling a unique repository of 50 different datasets for materials properties. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. Data were extracted from 16 publications. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. These benchmark data can serve as a basis for a more diverse benchmark dataset in the future to further improve their effectiveness in the comparison of machine learning models.Entities:
Keywords: Benchmark data; Machine learning; Materials datasets; Materials discovery; Materials informatics
Year: 2021 PMID: 34345637 PMCID: PMC8319566 DOI: 10.1016/j.dib.2021.107262
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
A list of the 16 primary sources that were used to create this collection of benchmark data. The sources are listed in alphabetical order.
| Balachandran, Prasanna V., et al. “Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning.” Nature communications 9.1 (2018): 1–9. |
| Carrete, Jesús, et al. “Finding unprecedentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling.” Physical Review X 4.1 (2014): 011,019. |
| Lee, Joohwi, et al. “Prediction model of band gap for inorganic compounds by combination of density functional theory calculations and machine learning techniques.” Physical Review B 93.11 (2016): 115,104. |
| Li, Wei, Ryan Jacobs, and Dane Morgan. “Predicting the thermodynamic stability of perovskite oxides using machine learning models.” Computational Materials Science 150 (2018): 454–463. |
| Liu, Yue, et al. “The onset temperature (Tg) of AsxSe1-x glasses transition prediction: A comparison of topological and regression analysis methods.” Computational Materials Science 140 (2017): 315–321. |
| Mannodi-Kanakkithodi, Arun, et al. “Machine learning strategy for accelerated design of polymer dielectrics.” Scientific reports 6 (2016): 20,952. |
| Pilania, Ghanshyam, et al. “Accelerating materials property predictions using machine learning.” Scientific reports 3.1 (2013): 1–6. |
| Pilania, Ghanshyam, et al. “Machine learning bandgaps of double perovskites.” Scientific reports 6 (2016): 19,375. |
| Pilania, Ghanshyam, and X-Y. Liu. “Machine learning properties of binary wurtzite superlattices.” Journal of materials science 53.9 (2018): 6652–6664. |
| Rajan, Arunkumar Chitteth, et al. “Machine-learning-assisted accurate band gap predictions of functionalized MXene.” Chemistry of Materials 30.12 (2018): 4031–4038. |
| Seko, Atsuto, et al. “Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single-and binary-component solids.” Physical Review B 89.5 (2014): 054,303. |
| Wei, Han, et al. “Predicting the effective thermal conductivities of composite materials and porous media by machine learning methods.” International Journal of Heat and Mass Transfer 127 (2018): 908–916. |
| Wu, K., et al. “Prediction of polymer properties using infinite chain descriptors (ICD) and machine learning: Toward optimized dielectric polymeric materials.” Journal of Polymer Science Part B: Polymer Physics 54.20 (2016): 2082–2091. |
| Xue, Dezhen, et al. “Accelerated search for materials with targeted properties by adaptive design.” Nature communications 7.1 (2016): 1–9. |
| Zeng, Shuming, et al. “Machine learning-aided design of materials with target elastic properties.” The Journal of Physical Chemistry C 123.8 (2019): 5042–5047. |
| Zhuo, Ya, Aria Mansouri Tehrani, and Jakoah Brgoch. “Predicting the band gaps of inorganic solids by machine learning.” The journal of physical chemistry letters 9.7 (2018): 1668–1673. |
Fig. 1A categorial dataset distribution of the 50 datasets compiled from 16 previous machine learning materials informatics papers. The categorization methods used are listed on the left and specific descriptors are listed above each colored bar of the graph. The number in each bar describes the number of datasets that fit that specification (e.g., 44 of the 50 datasets are regression tasks).
Fig. 2Information about the datasets that had train-val-test splits created using the 5-Fold cross-validation method. Each dataset is described by its name, material system, organic nature, material property, dataset size, data type, and task type. The paper of each respective dataset is provided as well in the left-most column.
Fig. 3Information about the datasets that had train-val-test splits created using the 10-Fold cross-validation method. Each dataset is described by its name, material system, organic nature, material property, dataset size, data type, and task type. The paper of each respective dataset is provided as well in the left-most column.
Fig. 4Information about the datasets that had train-test splits created using the Leave-One-Out cross-validation method. Each dataset is described by its name, material system, organic nature, material property, dataset size, data type, and task type. The paper of each respective dataset is provided as well in the left-most column.
Datasets that contain no features, only information regarding the material property. These datasets contain only two or three columns, as described in the text.
| Carrete_therm_conduct_train_clean | Pilania_superlattices_HSE_Band_Gap_clean |
|---|---|
| Mannodi_polymer_dielec/Electronic Dielectric Constant | Pilania_superlattices/Interfacial Energy (eV-angstrom^2) |
| Mannodi_polymer_dielec/HSE Band Gap (eV) | Pilania_superlattices/Lattice Parameter (angstrom) |
| Mannodi_polymer_dielec/Ionic Dielectric Constant | Seko_melt_temps |
| Mannodi_polymer_dielec/Total Dielectric Constant | Wu_DFT_Eg_dielec_consts/epsilon_e |
| Pilania_Polymers_data_Spring_Const_clean | Wu_DFT_Eg_dielec_consts/epsilon_i |
| Pilania_Polymers_data_total_Dielec_Const_clean | Wu_DFT_Eg_dielec_consts/GAP |
| Pilania_Polymers_data/Atomization Eng. (eV) | Wu_Exp_dielec_const |
| Pilania_Polymers_data/Bandgap (eV) | Wu_Exp_Tg |
| Pilania_Polymers_data/c [lattice param] (angstrom) | Wu_loss_tang_100Hz |
| Pilania_Polymers_data/elec. Dielec. Const | Wu_loss_tang_1kHz |
| Pilania_Polymers_data/Electron Affinity (eV) | Zeng_elastic_prop/G_Reuss |
| Pilania_Polymers_data/Formation Eng. (eV) | Zeng_elastic_prop/G_Voigt |
| Pilania_superlattices_elastic_consts/c11 (GPa) | Zeng_elastic_prop/G_VRH |
| Pilania_superlattices_elastic_consts/c12 (GPa) | Zeng_elastic_prop/K_Ress |
| Pilania_superlattices_elastic_consts/c13 (GPa) | Zeng_elastic_prop/K_voigt |
| Pilania_superlattices_elastic_consts/c33 (GPa) | Zeng_elastic_prop/K_VRH |
| Pilania_superlattices_elastic_consts/c44 (GPa) | Zhuo_classification_data |
| Pilania_superlattices_Formation_E_clean | Zhuo_regression_data |
| Pilania_superlattices_GGA_Band_Gap_clean |
Datasets that contain extra features besides only the material property and material composition. The features used in each dataset vary due to the different material system and material properties that were studied in each respective dataset.
| Bala_classification_dataset | Pilania_double_perovskites_clean |
|---|---|
| Bala_regression_dataset | Rajan_MXene_data |
| Lee_band_gaps | Wei_composite_materials |
| Li_DFT_and_features_clean | Wei_porous_media |
| Li_DFT_dataset_clean | Xue_thermal_hysteresis |
| Liu_Tg_AsSe_glass |
Fig. 5The first ten lines of the Zhuo_classification_data dataset. The left column describes compositions of inorganic solids while the right column gives the corresponding band gap values. This is an example of a dataset without features.
| Subject | Computational Materials Science |
| Specific subject area | Machine learning models for materials informatics |
| Type of data | Table |
| How data were acquired | Gathered data from past literature |
| Data format | Analyzed, Filtered |
| Parameters for data collection | Data were gathered from papers that had easily accessible data, had been published relatively recently (since 2013), and had used either a regression or classification machine learning model on any kind of material property. Any kind of material system, data type (experimental vs calculated), data size, and organic/inorganic materials were selected if said data fit the above parameters. |
| Description of data collection | The data collected were taken from past material science machine learning model papers whose data were either publicly available or were provided when one contacted the corresponding author. Each dataset was downloaded and analyzed using Python's seaborn.distplot and subsequently cleaned if needed. The specific papers that were used for this dataset are described in the Data Source Location section below. |
| Data source location | Primary data sources: See Table 1 |
| Data accessibility | Repository name: GitHub |