| Literature DB >> 31804487 |
Edward J Beard1,2, Ganesh Sivaraman3, Álvaro Vázquez-Mayagoitia3, Venkatram Vishwanath3, Jacqueline M Cole4,5,6,7.
Abstract
The ability to auto-generate databases of optical properties holds great prospects in data-driven materials discovery for optoelectronic applications. We present a cognate set of experimental and computational data that describes key features of optical absorption spectra. This includes an auto-generated database of 18,309 records of experimentally determined UV/vis absorption maxima, λmax, and associated extinction coefficients, ϵ, where present. This database was produced using the text-mining toolkit, ChemDataExtractor, on 402,034 scientific documents. High-throughput electronic-structure calculations using fast (simplified Tamm-Dancoff approach) and traditional (time-dependent) density functional theory were executed to predict λmax and oscillation strengths, f (related to ϵ) for a subset of validated compounds. Paired quantities of these computational and experimental data show strong correlations in λmax, f and ϵ, laying the path for reliable in silico calculations of additional optical properties. The total dataset of 8,488 unique compounds and a subset of 5,380 compounds with experimental and computational data, are available in MongoDB, CSV and JSON formats. These can be queried using Python, R, Java, and MATLAB, for data-driven optoelectronic materials discovery.Entities:
Year: 2019 PMID: 31804487 PMCID: PMC6895184 DOI: 10.1038/s41597-019-0306-0
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1A simple UV/vis absorption spectrum displaying the peak absorption wavelength, λ, whose intensity is given by the molar extinction coefficient, ϵ, whose computational analog is the oscillation strength, f.
Fig. 2The workflow associated with different stages of data processing. Stage I (top row): ChemDataExtractor extracts chemical information from the academic journal. Stage II (middle row): Unique chemical entries from the MongoDB server are passed through a fast-screening layer. Stage III (bottom row): Best candidates are identified and TD-DFT calculations are performed for those select cases. All of the stages utilize a secure MongoDB server for database management. Numbers enclosed in rectangular boxes indicate the number of data samples entering or leaving a stage.
Description of data records.
| Key | Description | Data type |
|---|---|---|
| inchikey | International chemical identifier key | String |
| doi | Source document DOI | String |
| lambda | Experimental value of wavelength | Float |
| lambda_unit | Reported unit of wavelength | String |
| extinction | Extinction coefficient | Float |
| extinction_unit | Reported unit of extinction coefficient | String |
| solvent | Solvent reported in the source document | String |
| amplitude | Computed value of wavelength | Float |
| oscillator_strength | Computed value of oscillator strength | Float |
Fig. 3Data validation. (a) Histogram of experimental λ values for all valid compounds in the dataset[6] (blue) overlaid with the AM 1.5 Global Tilt Spectra (red). (b) Histogram for different fractions drawn from the experimental λ values for all valid compounds in the dataset[6]. (c) Histogram for experimental extinction coefficients, ϵ, for all valid compounds in the dataset[6] (inset: experimental extinction coefficient percentiles with the outliers outlined in red). (d) Histograms for a subset of compounds of their experimental λ values (blue) and computed first excitation wavelengths (red). (e) Comparison of sTDA computed properties with the corresponding values computed by TD-DFT. (f) Bar chart of the 10 most common solvents used in the experimental measurements of λ and ϵ values, where there are at least 100 occurrences in the database[6]; solvents are ordered according to the increasing values of dielectric constants. NB: Units in plots correspond to those found most frequently during our data extraction.
Fig. 4Calculated first excited-state wavelengths, , versus λ experimental values extracted from the literature for the 76 compounds where both sTDA and TD-DFT calculations were undertaken, and only one experimental λ value was obtained. Calculations were performed using sTDA (left panel) or TD-DFT (center panel). Solid lines show the linear regression fit to each dataset and the shaded color regions show the corresponding 98% confidence interval. Mean absolute errors (MAEs) between calculated and experimental values are 64.50 nm and 51.57 nm for sTDA and TD-DFT calculations, respectively. (Right panel) Violin plot showing the wavelength distributions of experimental, sTDA and TD-DFT calculated data. White dots indicate a median; boxes show interquartile ranges; upper and lower whiskers show extremes. Any data beyond whiskers are outliers.
Distribution statistics of data associated with Fig. 4.
| Metric | Exp (nm) | sTDA (nm) | TD-DFT (nm) |
|---|---|---|---|
| Median | 378 | 327 | 344 |
| Upper quartile | 445 | 371 | 403 |
| Lower quartile | 331 | 303 | 315 |
| Measurement(s) | ultraviolet–visible spectrum • absorption wavelength • extinction coefficient |
| Technology Type(s) | digital curation |