| Literature DB >> 31728276 |
Lauri Himanen1, Amber Geurts1,2,3, Adam Stuart Foster1,4,5, Patrick Rinke1,6.
Abstract
Data-driven science is heralded as a new paradigm in materials science. In this field, data is the new resource, and knowledge is extracted from materials datasets that are too big or complex for traditional human reasoning-typically with the intent to discover new or improved materials or materials phenomena. Multiple factors, including the open science movement, national funding, and progress in information technology, have fueled its development. Such related tools as materials databases, machine learning, and high-throughput methods are now established as parts of the materials research toolset. However, there are a variety of challenges that impede progress in data-driven materials science: data veracity, integration of experimental and computational data, data longevity, standardization, and the gap between industrial interests and academic efforts. In this perspective article, the historical development and current state of data-driven materials science, building from the early evolution of open science to the rapid expansion of materials data infrastructures are discussed. Key successes and challenges so far are also reviewed, providing a perspective on the future development of the field.Entities:
Keywords: artificial intelligence; data science; databases; machine learning; materials; materials science; open innovation; open science
Year: 2019 PMID: 31728276 PMCID: PMC6839624 DOI: 10.1002/advs.201900808
Source DB: PubMed Journal: Adv Sci (Weinh) ISSN: 2198-3844 Impact factor: 16.806
Figure 1Materials discovery schematic. In the traditional approach, new materials are discovered by experimentation, theory, or computation (also referred to as 1st, 2nd, and 3rd paradigms and symbolized by the three icons at the top of the left panel). In the 4th paradigm of data‐driven materials science, available data is gathered in data infrastructures, and machine learning approaches discover new materials.
List of current major materials data infrastructures. The entries are divided into non‐commercial (top) and commercial (bottom). Note that some platforms are named after the leading research project and may host multiple services under different names. As contact person we listed the director(s) of each infrastructure, in such cases, where they were clearly identifiable. Data volume numbers reflect the state in April 2019
| Name | Website | Contact | Overview | Ref. |
|---|---|---|---|---|
| AFLOW | aflowlib.org | Stefano Curtarolo, Duke University | Computational data consisting of 2 118 033 material compounds and 281 698 389 calculated properties with focus on inorganic crystal structures. Incorporates multiple computational modules for automating high‐throughput first principles calculations. |
|
| Computational Materials Repository | cmr.fysik.dtu.dk | Kristian Thygesen and Karsten Jacobsen, DTU | Computational datasets from a diverse set of applications. Data creation and analysis with the Atomic Simulation Environment (ASE). |
|
| Crystallography Open Database | crystallography.net | Open‐access collection of crystal structures of organic, inorganic, metal–organic compounds and minerals, excluding biopolymers. |
| |
| HTEM | htem.nrel.gov | Caleb Phillips and Andriy Zakutayev, NREL | Properties of thin films synthesized using combinatorial methods. Contains 57 597 thin film samples, across a wide range of materials (oxides, nitrides, sulfide, intermetallics). |
|
| Khazana | khazana.gatech.edu | Rampi Ramprasad, Georgia Institute of Technology | Platform to store structure and property data created by atomistic simulations, and tools to design materials by learning from the data. Tools include Polymer Genome and AGNI. |
|
| MARVEL NCCR | nccr‐marvel.ch | Nicola Marzari, EPFL | Materials informatics platform for data‐driven high‐throughput quantum simulations. Data available at materialscloud.org, powered by the AiiDA‐infrastructure. |
|
| Materials Data Facility (MDF) | materialsdatafacility.org | Ben Blaiszik and Ian Foster, University of Chicago | Data publication network for computational and experimental datasets. Data exploration through the Forge python package. |
|
| Materials Project | materialsproject.org | Kristin Persson, LBNL | Online platform for materials exploration containing data of 86 680 inorganic compounds, 21 954 molecules and 530 243 nanoporous materials. Develops various open‐source software libraries, including pymatgen, custodian, FireWorks, and atomate. |
|
| MatNavi/NIMS | mits.nims.go.jp | Yibin Xu, NIMS | An integrated material database system comprising ten databases, four application systems and the NIMS Structural Datasheet Online. |
|
| NOMAD CoE | nomad‐coe.eu | Matthias Scheffler, FHI/Max Planck Society | Provides storage for full input and output files of all important computational materials science codes, with multiple big‐data services built on top. Contains over 50 236 539 total energy calculations. |
|
| Organic Materials Database | omdb.mathub.io | Alexander Balatsky, Nordita | Open access electronic structure database for 3‐dimensional organic crystals. Contains approximately 24 000 materials. |
|
| Open Quantum Materials Database | oqmd.org | Chris Wolverton, Northwestern University | Database of DFT‐calculated thermodynamic and structural properties with focus on inorganic crystal structures. Contains 563 247 entries with support for full download and advanced usage through the qmpy python package. |
|
| Open Materials Database | openmaterialsdb.se | Rickard Armiento, Linköping University | Computational database primarily based on structures from the Crystallography Open Database. Data creation and analysis with High‐Throughput Toolkit (httk). |
|
| SUNCAT | suncat.stanford.edu | Thomas Francisco Jaramillo, SLAC/Stanford University | Materials informatics center for atomic‐scale design of catalysts. Online tools and computational results for 112 157 surface reactions and barriers available at catalysis‐hub.org. |
|
| Citrine Informatics | citrine.io | Bryce Meredig and Greg Mulholland | A materials informatics platform combining data infrastructure and AI. Open database and analytics platform for material and chemical information available at the Citrination platform: citrination.com. |
|
| Exabyte.io | exabyte.io | Timur Bazhirov | Cloud‐based modelling platform for materials informatics. |
|
| Granta Design | grantadesign.com | Mike Ashby and David Cebon | R&D organization offering data, tools and expertise for materials design. |
|
| Materials Design | materialsdesign.com | Clive M. Freeman, Erich Wimmer and Stephen J. Mumby | Software products and services for chemical, metallurgical, electronic, polymeric, and materials science research applications. |
|
| Materials Platform for Data Science | mpds.io | Evgeny Blokhin | Online edition of the PAULING FILE with focus on curated experimental data for inorganic materials. |
|
| MaterialsZone | materials.zone | Assaf Anderson and Barak Sela | Provides a notebook‐based materials informatics environment together with experimental data. |
|
| SpringerMaterials | materials.springer.com | Michael Klinge | Curated data covering multiple material classes, property types, and applications. A set of advanced functionalities for visualizing and analyzing data provided through SpringerMaterials Interactive. |
|
Figure 2Timeline and geographic distribution of materials data infrastructures and companies. The colour of the dots represents the time of establishment. The map shows that historically more centers have emerged in the U.S. and Europe, with Asia catching up over time. In addition, the U.S. has a higher renewal rate than Europe, as can be seen in the larger number of ligher colored dots. CSD: Cambridge Structural Database, ICSD: Inorganic Crystal Structure Database, ESP: Electronic Structure Project, AFLOW: Automatic‐Flow for Materials Discovery, AIST: National Institute of Advanced Industrial Science and Technology Databases, COD: Crystallography Open Database, MatDL: Materials Digital Library, CMR: Computational Materials Repository, NREL CID: NREL Center for Inverse Design, CEPDB: The Clean Energy Project Database, MGI: Materials Genome Initiative, CMD: Computational Materials Network, OQMD: Open Quantum Materials Database, NOMAD: Novel Materials Discovery Laboratory, MaX: Materials Design at the Exascale, MICCOM: Midwest Integrated Center for Computational Materials, MPDS: Materials Platform for Data Science, CMI2: Center for Materials Research by Information Integration, HTEM: High Throughput Experimental Materials Database, JARVIS: Joint Automated Repository for Various Integrated Simulations, OMDB: Organic Materials Database, QCArchive: The Quantum Chemistry Archive.
Figure 3Number of materials informatics projects and infrastructures as function of time (see Figure 2 and Table 1 for details on individual projects and infrastructures). We divide the time axis into three periods that reflect the evolution of the data infrastructures (see text for details).
Services provided by the selected materials data infrastructures. Open Access: provides partial or full free access to data. Computational data: contains data originating from software simulations. Experimental data: contains data originating from experiments. Data upload: allows upload of own data, with the possibility of issuing Digital Object Identifiers (DOIs). Workflow management tools: provide or collaborate in the development of open‐source software tools for workflow management. Web API: data can be accessed remotely with automated scripts. Data analysis tools: provide online or offline data analysis tools, including machine learning
| Open access | Comp. data | Exp. data | Data upload (DOIs) | Workflow management tools | Web API | Data analysis tools | |
|---|---|---|---|---|---|---|---|
| AFLOW | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Computational Materials Repository | ✓ | ✓ | ✓ | ✓ | |||
| Crystallography Open Database | ✓ | ✓ | ✓ | ✓ | |||
| HTEM | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Khazana | ✓ | ✓ | ✓ | ✓ | |||
| MARVEL NCCR | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Materials Data Facility (MDF) | ✓ | ✓ | ✓ | ✓ (DOI) | ✓ | ||
| Materials Project | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| MatNavi/NIMS | ✓ | ✓ | ✓ | ✓ | |||
| NOMAD CoE | ✓ | ✓ | ✓ (DOI) | ✓ | ✓ | ||
| Organic Materials Database | ✓ | ✓ | ✓ | ||||
| Open Quantum Materials Database | ✓ | ✓ | ✓ | ||||
| Open Materials Database | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| SUNCAT | ✓ | ✓ | ✓ | ✓ | |||
| Citrine Informatics | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Exabyte.io | ✓ | ✓ | |||||
| Granta Design | ✓ | ✓ | ✓ | ||||
| Materials Design | ✓ | ✓ | ✓ | ||||
| Materials Platform for Data Science | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Materials Zone | ✓ | ✓ | |||||
| Springer Materials | ✓ | ✓ |
Upload requires access to private/institutional storage space
Open access to a subset of data
Open access to limited set of materials properties.
Figure 4Challenges faced by materials data infrastructures (on the left) on the way to increase the adoption by stakeholders from academia, industry, governments and the public (depicted on the right).
Figure 5Example of an ontological hierarchy for the structural characterization of materials: a materials tree of life. Adapted under the terms of the Creative Commons Attribution 4.0 International License.145 Copyright 2018, the Authors, Published by Springer Nature.
Figure 6A computational workflow used in creating a dataset of elastic tensors with the FireWorks workflow manager. The indigo boxes correspond to inputs or results, lighter blue boxes correspond to actions, and green diamonds correspond to decisions. Adapted with permission.188 Copyright 2019, Wiley.
Figure 7Key steps in building a machine learning model. The white arrows indicate the flow of data, green arrows indicate actions that can be identified and performed after analysis to improve the performance of the model.
Figure 8The machine learning domain in terms of data volume and the complexity of the physical process, with selected examples placed in this domain. The complexity of a physical process here means the complex, nonlinear structures present in the data. Two opposing learning scenarios, a hard and an easy one, are illustrated in the lower panel. In these two cases, the underlying physical process is represented by a colored contour map, and the sampling of this process is represented by black crosses.223, 224, 225, 226, 227, 228
Figure 9a) Schematic of an ecosystem in data‐driven materials science with materials data platforms at the center. In this ecosystem, different stakeholders from universities, the public, industry, and government facilitate the development of a technology. b–d) Possible relationships between data platforms and industry, discussed in the text.