| Literature DB >> 35403727 |
Pim N H Wassenaar1,2, Emiel Rorije1, Martina G Vijver2, Willie J G M Peijnenburg1,2.
Abstract
Screening and prioritization of chemicals is essential to ensure that available evaluation capacity is invested in those substances that are of highest concern. We, therefore, recently developed structural similarity models that evaluate the structural similarity of substances with unknown properties to known Substances of Very High Concern (SVHC), which could be an indication of comparable effects. In the current study the performance of these models is improved by (1) separating known SVHCs in more specific subgroups, (2) (re-)optimizing similarity models for the various SVHC-subgroups, and (3) improving interpretability of the predicted outcomes by providing a confidence score. The improvements are directly incorporated in a freely accessible web-based tool, named the ZZS similarity tool: https://rvszoeksysteem.rivm.nl/ZzsSimilarityTool. Accordingly, this tool can be used by risk assessors, academia and industrial partners to screen and prioritize chemicals for further action and evaluation within varying frameworks, and could support the identification of tomorrow's substances of concern.Entities:
Keywords: chemical similarity; classification model; screening and prioritization; substances of very high concern
Year: 2022 PMID: 35403727 PMCID: PMC9322536 DOI: 10.1002/jcc.26859
Source DB: PubMed Journal: J Comput Chem ISSN: 0192-8651 Impact factor: 3.672
FIGURE 1Illustration of the workflow of each of the separate similarity models that are incorporated in the ZZS similarity tool (note that there are some variations for the specific sub‐models, see section ‘3.2 models’). Step 1 and 6 consider the input and output as shown by the ZZS similarity tool, and step 2–5 are used to calculate and predict the structural similarity. The exact specifications of step 3–5 differ per SVHC‐category. An input structure can be provided as SMILES or CAS‐number (step 1), which is converted to a standardized SMILES to ensure equal comparison to SVHC structures (step 2). The standardized SMILES is used to generate chemical fingerprints using PaDEL‐descriptor (step 3). The fingerprint of the input structure is compared to the fingerprints of all SVHCs of a specific category to calculated similarity values by using a similarity coefficient (step 4). The calculated similarity values are compared to a similarity threshold to predict whether the input structure is considered sufficiently structurally similar to an SVCH (step 5), and the results are reported (step 6). For each SVHC‐category a specific model was developed and optimized, that consists of a unique fingerprint, coefficient and threshold combination; and the outcomes are reported separately for each SVHC‐category
Aspects of the structural similarity models that are adjusted within the current study.
| Adjusted aspects | Description and motivation |
|---|---|
| Dataset | Update of the underlying SVHC dataset. |
| Model‐separation |
Separation of CM and R concerns, as these effects are often exerted via different mode of actions. Improved distinction between European SVHCs (including CLP classifications and POP identifications) and Dutch SVHCs. |
| Model (re‐)optimization | Optimization of the sub‐models. Specifically necessary for the PBT/vPvB category, for which a moderate performance on the broader universe of chemicals was observed. |
| Outcome interpretation | Addition of a quantitative confidence score, besides the qualitative conclusion (sufficiently similar: yes/no), to support better outcome interpretation. |
Abbreviations: SVHC—substances of very high concern; CMR—carcinogenic (C), mutagenic (M) or reprotoxic (R) properties; PBT/vPvB—very (v) persistent (P), bioaccumulative (B) and toxic (T) properties; CLP—classification, labelling and packaging of substances and mixtures; POP—persistent organic pollutants.
FIGURE 2Relation between the structural similarity value and the confidence in the predicted structural similarity between a chemical and a Reprotoxic (R)‐SVHC based on the CDK extended‐CT4 fingerprint‐coefficient combination. The fitted curves describe the normalized bPPV as a function of the similarity value used as a threshold value, and are derived from the R‐SVHC and non‐SVHC datasets (for substances with less than 85 fragment features, that is, bits in the CDK extended fingerprint). The vertical line represents the model's optimized threshold value (0.851) giving the best balanced accuracy, and the horizontal line represents the 50% confidence score. More details are presented in Supplemental Material S3
Overview of the new dataset and the distribution over hazard categories, in comparison to the previous dataset as included in Reference [2].
| Hazard class | Previous dataset | New dataset |
|---|---|---|
| Total | 546 | 621 |
| CM | 1501 | 153 |
| R | 1661 | 178 |
| PBT/vPvB | 209 | 137 |
| ED | 52 | 51 |
| Other | ‐2 | 1313 |
Note: 1—In the previous work, CM and R were combined as one class (n = 306). 2—In the previous work, no ‘Other’‐category was included. 3—The ‘Other’‐category consists of 10 substances that are identified as EU‐SVHC based on PMT (n = 3) or respiratory sensitizing properties (n = 7). All others are not identified as EU‐SVHC, EU‐CLP or POP, but are included on the Dutch list of SVHCs based on specific concerns related to similar endpoints (C: n = 3, M: n = 1, R: n = 14, PBT: n = 64, PBT/vPvB: n = 29, ED: n = 6, PMT: n = 2, and others: n = 2) from other sources (e.g., OSPAR ; in which PBT/vPvB concerns are dominating).
Overview of the final models—including performance statistics—to predict structural similarity to SVHCs.
| Subset | Fingerprint | Coefficient | Threshold | #SVHCs | #non‐SVHCs | TP | FP | TN | FN | Sens | Spec | bAcc | bPPV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CM <85 | CDK Extended | CT4 | 0.851 | 89 | 63 | 55 | 6 | 57 | 34 | 0.618 | 0.905 | 0.761 | 0.866 |
| CM ≥85 | CDK Extended | SM | 0.944 | 64 | 343 | 36 | 5 | 338 | 28 | 0.563 | 0.985 | 0.774 | 0.975 |
| CM‐combined1 | ‐ | ‐ | ‐ | 153 | 406 | 91 | 11 | 395 | 62 | 0.595 | 0.973 | 0.784 | 0.956 |
| R < 85 | CDK Extended | CT4 | 0.851 | 57 | 63 | 36 | 7 | 56 | 21 | 0.632 | 0.889 | 0.760 | 0.850 |
| R ≥ 85 | CDK Extended | SM | 0.944 | 121 | 343 | 77 | 7 | 336 | 44 | 0.636 | 0.980 | 0.808 | 0.969 |
| R‐combined1 | ‐ | ‐ | ‐ | 178 | 406 | 113 | 14 | 392 | 65 | 0.635 | 0.966 | 0.800 | 0.948 |
| PBT‐1 | PubChem | JT | 0.774 | 137 | 406 | 130 | 2 | 404 | 7 | 0.949 | 0.995 | 0.972 | 0.995 |
| PBT‐2 | CDK Extended | CT4 | 0.887 | 137 | 406 | 124 | 1 | 405 | 13 | 0.905 | 0.998 | 0.951 | 0.997 |
| PBT‐combined2 | ‐ | ‐ | ‐ | 137 | 406 | 123 | 1 | 405 | 14 | 0.898 | 0.998 | 0.948 | 0.997 |
| ED | CDK Extended | JT | 0.693 | 51 | 406 | 50 | 0 | 406 | 1 | 0.980 | 1.000 | 0.999 | 1.000 |
| Other‐1 | PubChem | JT | 0.818 | 131 | 406 | 87 | 15 | 391 | 44 | 0.664 | 0.963 | 0.814 | 0.947 |
| Other‐2 | CDK Extended | CT4 | 0.901 | 131 | 406 | 76 | 17 | 389 | 55 | 0.580 | 0.958 | 0.769 | 0.933 |
| Other‐combined2 | ‐ | ‐ | ‐ | 131 | 406 | 73 | 10 | 396 | 58 | 0.557 | 0.975 | 0.766 | 0.958 |
Note: 1—Substance is either assessed on structural similarity according to model 1 or model 2, depending on its number of fragment features. 2—Substance is assessed based on model 1 and model 2, and is only considered as structurally similar to an SVHC when it meets the criteria of both models.
Application of the newly optimized similarity models to a dataset of 9456 REACH registered substances.
| Model | Similar substances | Similar substances by previous models | 50%–75% confidence | 75%–90% confidence | ≥90% confidence |
|---|---|---|---|---|---|
| CM‐combined1 | 1060 | ‐3 | 701 | 149 | 210 |
| CM < 85 | 688 | 466 | 76 | 146 | |
| CM ≥ 85 | 372 | 235 | 73 | 64 | |
| R‐combined1 | 936 | ‐3 | 729 | 98 | 109 |
| R < 85 | 522 | 376 | 60 | 86 | |
| R ≥ 85 | 414 | 353 | 38 | 23 | |
| PBT/vPvB | 532 | 3604 | 38 | 13 | 2 |
| ED | 109 | 1395 | 86 | 13 | 10 |
| Other | 1292 | 5544,6 | 32 | 46 | 51 |
Note: The confidence‐bins represents the number of substances that are predicted to be structurally similar to an SVHC with a specific confidence in the structural similarity. The previously used similarity models are described by Wassenaar et al. 1—Combination of two sub‐models. 2—For two chemicals the PubChem fingerprint could not be generated (total = 9454). 3—The CM‐ and R‐models were not adjusted. 4—For one chemical the MACCS fingerprint could not be generated (total = 9455). 5—For 82 chemicals the RDkit equivalent FCFP4‐fingerprint could not be generated (total = 9374). 6—The previously derived PBT/vPvB model was applied to the ‘Other’‐dataset (as the ‘Other’‐SVHCs mainly consists of SVHCs previously included in the PBT/vPvB‐SVHC dataset).
Specific examples of predictions by the PBT/vPvB‐model, including confidence scores in structural similarity.
| ID | Substance with ‘unknown’ properties | Previous most similar known SVHC | Previous model prediction | New most similar known SVHC | New model prediction | New model confidence in structural similarity |
|---|---|---|---|---|---|---|
| 1 |
|
|
|
| Non‐SVHC | 43% |
| 2 |
|
|
|
|
| 75% |
| 3 |
|
|
|
|
| 83% |
| 4 |
|
|
|
|
| 93% |
| 5 |
|
|
|
|
| 96% |
| 6 |
|
|
|
| Non‐SVHC | 23% |
| 7 |
|
|
|
| Non‐SVHC | 7% |
| 8 |
|
|
|
| Non‐SVHC | 1% |
| 9 |
|
|
|
| Non‐SVHC | 0% |
| 10 |
|
|
|
|
| 75% |
| 11 |
|
|
|
| Non‐SVHC | 2% |
| 12 |
|
| Non‐SVHC |
|
| 81% |
Note: The examples illustrate the model's improved consideration of the number of halogenated fragments (examples 1–7), type of halogenated fragments (example 8), and backbone/aromatic structures (examples 9–11). A deficiency of the new model is illustrated with example 12. Similar improvements are observed for the ‘other’‐model, see Supplemental Material S5. 1—This SVHC considers the third most similar SVHC, with a comparable similarity value for the two other most similar SVHCs (i.e., all with 43% confidence in structural similarity). This structure is included in this table as illustrative example in relation to examples 2–5.
FIGURE 3The ZZS similarity tool main web‐page with the input modes: Single search and batch search (using SMILES and/or CAS‐numbers)