| Literature DB >> 31395888 |
Murat Cihan Sorkun1,2, Abhishek Khetan1,2, Süleyman Er3,4.
Abstract
Water is a ubiquitous solvent in chemistry and life. It is therefore no surprise that the aqueous solubility of compounds has a key role in various domains, including but not limited to drug discovery, paint, coating, and battery materials design. Measurement and prediction of aqueous solubility is a complex and prevailing challenge in chemistry. For the latter, different data-driven prediction models have recently been developed to augment the physics-based modeling approaches. To construct accurate data-driven estimation models, it is essential that the underlying experimental calibration data used by these models is of high fidelity and quality. Existing solubility datasets show variance in the chemical space of compounds covered, measurement methods, experimental conditions, but also in the non-standard representations, size, and accessibility of data. To address this problem, we generated a new database of compounds, AqSolDB, by merging a total of nine different aqueous solubility datasets, curating the merged data, standardizing and validating the compound representation formats, marking with reliability labels, and providing 2D descriptors of compounds as a Supplementary Resource.Entities:
Year: 2019 PMID: 31395888 PMCID: PMC6687799 DOI: 10.1038/s41597-019-0151-1
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Process diagram of curating solubility dataset.
List of datasets used to curate AqSolDB.
| Dataset | Original | Filtered | Compound | Solubility |
|---|---|---|---|---|
| A[ | 14,180 | 6,110 | name, CAS | g/L, mg/L, |
| B[ | 5,764 | 4,651 | name, CAS | LogS |
| C[ | 2,603 | 2,603 | name, SMILES | LogS |
| D[ | 2,267 | 2,115 | name, CAS | LogS |
| E[ | 1,291 | 1,291 | name, SMILES, CAS | LogS |
| F[ | 1,210 | 1,210 | SLN | LogS |
| G[ | 1,144 | 1,144 | name, SMILES | LogS |
| H[ | 578 | 578 | SLN | LogS |
| I[ | 105 | 94 | name, SMILES, InChI |
Dataset ID: identifier of the dataset during the curation process. Original Size: number of instances of the dataset when we collected. Filtered Size: number of instances after the pre-process. Compound Representation: available compound representations of the dataset when we collected. Solubility Units: units of experimental solubility values of the dataset.
Fig. 2Validation steps of compound representations. Blue box represents the SMILES values from the dataset and gray boxes represent the generated values using RDKit. Red arrows represent the conversion steps and green equal sign represents the validation of consistency.
Fig. 3Redundancy matrices showing fractional values for shared compounds between all collected datasets. (a) shows fraction of compounds with differing solubility values, and (b) shows fraction of compounds with the same solubility values.
Fig. 4Flowchart of the curation algorithm. Green box represents the initial state. Blue diamond shapes represent a decision according to the number of occurrences of a compound and the SD of multiple occurrences. Pink boxes represent the reliability group. Gray boxes represent the selection method for multiple occurrences. The numbers over the arrows represent the number of unique compound in the corresponding classification path.
List of available information in terms of name, description, and type of each column in the AqSolDB.
| Column Name | Description | Type |
|---|---|---|
| ID | ID from source (also shows the source) | string |
| Name | Name of compound | string |
| InChI | The IUPAC International Chemical Identifier | string |
| InChIKey | Hashed form of InChI value | string |
| SMILES | SMILES representation of compound | string |
| Solubility | Experimental aqueous solubility value (LogS) | float |
| SD | Standard deviation of multiple occurrences | float |
| Occurrences | Number of occurrences of compound | integer |
| Group | Generated reliability group (G1, G2, G3, G4, G5) | string |
| Mol Wt | Molecular weight | float |
| Mol LogP | Octanol-water partition coefficient | float |
| Mol MR | Molar refractivity | float |
| Heavy Atom Count | Number of non-H atoms | integer |
| Num H Acceptors | Number of H acceptors | integer |
| Num H Donors | Number of H donors | integer |
| Num Heteroatoms | Number of atoms not carbon or hydrogen | integer |
| Num Rotatable Bonds | Number of rotatable bonds | integer |
| Num Valence Electrons | Number of valence electrons | integer |
| Num Aromatic Rings | Number of aromatic rings | integer |
| Num Saturated Rings | Number of saturated rings | integer |
| Num Aliphatic Rings | Number of aliphatic rings | integer |
| Ring Count | Number of total rings | integer |
| TPSA | Topological polar surface area | float |
| Labute ASA | Labute’s Approximate Surface Area | float |
| Balaban J | Balaban’s J index (graph index) | float |
| Bertz CT | A topological complexity index of compound | float |
Fig. 5Bar charts for analyzing the curated dataset. (a) Distribution of instances according to source dataset. (b) Distribution of instances according to reliability group. (c) Distribution of instances according to aqueous solubility ranges (LogS).
| Design Type(s) | chemical reaction data analysis objective • data integration objective • data validation objective |
| Measurement Type(s) | aqueous solubility |
| Technology Type(s) | digital curation |
| Factor Type(s) | physical state |
| Sample Characteristic(s) |