| Literature DB >> 35618761 |
Kevin Cruse1,2, Amalie Trewartha2, Sanghoon Lee1,3, Zheren Wang1,2, Haoyan Huo1,2, Tanjin He1,2, Olga Kononova1,2, Anubhav Jain3, Gerbrand Ceder4,5.
Abstract
Gold nanoparticles are highly desired for a range of technological applications due to their tunable properties, which are dictated by the size and shape of the constituent particles. Many heuristic methods for controlling the morphological characteristics of gold nanoparticles are well known. However, the underlying mechanisms controlling their size and shape remain poorly understood, partly due to the immense range of possible combinations of synthesis parameters. Data-driven methods can offer insight to help guide understanding of these underlying mechanisms, so long as sufficient synthesis data are available. To facilitate data mining in this direction, we have constructed and made publicly available a dataset of codified gold nanoparticle synthesis protocols and outcomes extracted directly from the nanoparticle materials science literature using natural language processing and text-mining techniques. This dataset contains 5,154 data records, each representing a single gold nanoparticle synthesis article, filtered from a database of 4,973,165 publications. Each record contains codified synthesis protocols and extracted morphological information from a total of 7,608 experimental and 12,519 characterization paragraphs.Entities:
Year: 2022 PMID: 35618761 PMCID: PMC9135747 DOI: 10.1038/s41597-022-01321-6
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1AuNP synthesis publication extraction pipeline. Starting with the >4.5 million article materials science literature database, parsed paragraphs from the articles are funneled through progressively finer-meshed filters to identify those related to the synthesis of gold nanoparticles. The first two steps include a regex search for “nano” phrases followed by the vectorization of that corpus using TF-IDF, similar to the method used in Hiszpanski et al.[20]. Those articles where the TF-IDF scores for gold are higher than other noble metal nanoparticle compositions are accepted. Each paragraph from those articles is then passed through a binary classifier which determines whether or not the paragraph describes the synthesis of gold nanoparticles. Finally, after extracting the synthesis recipes from those relevant paragraphs, 5,154 articles with synthesis paragraphs containing gold- or gold nanoparticle-related targets are collected. An example synthesis paragraph along with a sample of the extracted information is shown in the bottom panel.
Format for lower paragraph-level of each data record: description, key label, data type.
| Data description | Data Key Label | Data Type |
|---|---|---|
| Unique paragraph hash | _id | |
| Whether paragraph contains characterization information | contains_characterization | |
| Whether paragraph contains synthesis recipe | contains_recipe | |
| Materials and quantities contained in paragraph1 | materials_and_quantities | |
| –material: | ||
| –amount: | ||
| –value: | ||
| –unit: | ||
| Condensed morphological entities in paragraph2 | morphological_information | Object ( |
| –descriptors: | ||
| –measurements: | ||
| –morphologies: | ||
| –sizes: | ||
| –units: | ||
| Morphological entities and token locations in paragraph2 | morphology_ner_tokens | |
| –annotation: | ||
| –start: | ||
| –end: | ||
| –text: | ||
| Whether or not the AuNP synthesis is seed-mediated1 | seed_mediated | |
| List of constituent sentences and extracted data1 | sentences | |
| Synthesis actions and conditions1 | synth_actions | |
| –conditions: Object ( | ||
| –temperature: Object( | ||
| –time: Object( | ||
| –string: | ||
| –subject: | ||
| –type: | ||
| Snippet of paragraph text | text |
1Only if contains_recipe is true.
2Only if contains_characterization or contains_recipe is true
3Contents of paragraphs shown in Table 3.
4{value: list, unit: string, max_value: float, min_value: float}.
Format for highest article-level of each data record: description, key label, data type.
| Data description | Data Key Label | Data Type |
|---|---|---|
| DOI of the original paper | doi | |
| List of constituent paragraphs and extracted data | paragraphs | |
| Year of publication | publication_year | |
| Number of citations | times_referenced |
1Contents of paragraphs shown in Table 2.
Format for lowest sentence-level of each data record: description, key label, data type.
| Data description | Data Key Label | Data Type |
|---|---|---|
| All material entities | all_materials | |
| Non-precursors and non-target material entities | other_materials | |
| Precursor material entities | precursors | |
| Sequence of synthesis operations and conditions | procedure_graph | |
| –env_toks: | ||
| –op_token: | ||
| –op_type: | ||
| –ref_op: | ||
| –subject: | ||
| –temp_values: | ||
| –time_values: | ||
| Target material entities | target |
1{material: string, amount: [{value: float, unit: string}]}.
2{max: float, min: float, tok_ids: list, units: string, values: list}.
Text extraction model accuracies.
| Pipeline Component | ML Method | F1: (precision|recall) |
|---|---|---|
| Article filtering | regex/TF-IDF | 0.96: (1.00|0.92) |
| Synthesis paragraph classification | BERT classification | 0.90: (0.96|0.85) |
| Characterization paragraph classification | BERT classification | 0.90: (0.93|0.87) |
| Materials Entity Recognition | BiLSTM+CRF (MatBERT embeddings) | |
| 0.95:(0.95|0.95) - materials1 | ||
| 0.90:(0.89|0.91) - precursors1 | ||
| 0.85:(0.86|0.83) - targets1 | ||
| Morphology Entity Recognition | Fine-tuned MatBERT NER model | 0.87:(0.89|0.84) - Micro average |
| 0.92:(0.90|0.95) - MOR (morphology) | ||
| 0.56:(0.70|0.52) - DES (descriptor) | ||
| 0.70:(0.83|0.64) - MES (measurement) | ||
| 0.69:(0.81|0.62) - SIZ (size value) | ||
| 0.91:(0.94|0.91) - UNT (unit) | ||
| Synthesis actions2 | BiLSTM (Word2Vec embeddings) | 0.89 (0.90|0.88) |
| Synthesis conditions3 | Rule-based | |
| – Temperature | 0.94: (0.97|0.92) | |
| – Time | 0.93: (0.98|0.89) | |
| Material quantities3 | Rule-based | 0.87: (0.90|0.85) |
| Seed-mediated tag | Rule-based | 1.00: (1.00|1.00) |
1Metrics from He et al.[33].
2Metrics from associated manuscript on synthesis actions extraction[35].
3Metrics from accepted publication on solution synthesis extraction[42].
Fig. 2Frequencies of most common AuNP synthesis precursors. The most frequently extracted precursors using materials entity recognition (MER, “Synthesis recipe extraction”) were inspected and compiled into a regular expression-based synonym map, which is housed in the available code repository (see “Code availability”). The precursors are binned by their function in AuNP synthesis and their presence in the number of publications employing seed-mediated growth or seedless methods is distinguished. Citrates are considered in their own category since their function varies depending on the method used, for instance citrates are used as a reducing agent and a ligand in Turkevich or Frens reduction, but only as a ligand in other reduction methods. Only the precursors appearing in more than 50 articles are shown. Precursors were counted once per article within the seed-mediated and seedless growth categories for this analysis to avoid double counting precursors which may be mentioned in a purchasing paragraph and a synthesis paragraph, both of which can sometimes be classified as a synthesis paragraph by our binary gold nanoparticle synthesis paragraph classifier.
Fig. 3Breakdown of reported AuNP morphologies discussed by year. Similarly to the frequent precursors analysis, we compiled a regular expression-based synonym map for the most frequently extracted morphological entities and descriptors. The timeline represents the cumulative number of publications discussing each of the specified morphologies from 1998–2021. The entire plot represents 1,744 articles from the dataset, each of which discuss only one of the morphologies in the set depicted.
Fig. 4Heatmap depicting correlation between precursors and resultant AuNP morphologies. The heat illustrated in a given cell represents the fraction of morphologically-targeted articles (say, the fraction of sphere-related articles) which use that particular precursor among one or more precursors it uses in the recipe. For instance, the top left cell shows that more than 90% of purely sphere-related AuNP synthesis papers use as a precursor. “Citrate” also includes sodium citrate precursors. The entire heatmap describes 1,511 single morphology-targeted articles with at least one of the precursors or precursor ions shown on y-axis.
| Measurement(s) | gold nanoparticle morphology • gold nanoparticle size • gold nanoparticle synthesis data |
| Technology Type(s) | natural language processing |