| Literature DB >> 35496477 |
Shivalika Tanwar1, Patrick Auberger2,3, Germain Gillet4, Mario DiPaola5, Katya Tsaioun5,6, Bruno O Villoutreix1,4.
Abstract
Drug discovery often requires the identification of off-targets as the binding of a compound to targets other than the intended target(s) can be beneficial in some cases or detrimental in other situations (e.g., binding to anti-targets). Such investigations are also of importance during the early stage of a project, for example when the target is not known (e.g., phenotypic screening). Target identification can be performed in-vitro, but various in-silico methods have also been developed in recent years to facilitate target identification and help generate ideas. FastTargetPred is one such approach, it is a freely available Python/C program that attempts to predict putative macromolecular targets (i.e., target fishing) for a single input small molecule query or an entire compound collection using established chemical similarity search approaches. Indeed, the putative macromolecular target(s) of a small chemical compound can be predicted by identifying ligands that are known experimentally to bind to some targets and that are structurally similar to the input query chemical compound. Therefore, this type of target fishing approach relies on a large collection of experimentally validated macromolecule-chemical compound binding data. The small chemical compounds can be described as molecular fingerprints encoding their structural characteristics as a vector. The published version of FastTargetPred used ligand-target binding data extracted from the release 25 (2019) of the ChEMBL database. Here we provide a new dataset for FastTargetPred extracted from the last ChEMBL release, namely, at the time of writing, ChEMBL29 (2021). Four fingerprints were computed (ECFP4, ECFP6, MACCS and PL) for the extracted compound dataset (714,780 unique ChEMBL29 compounds while the entire ChEMBL29 database contained about 2.1 million compounds). However, it was not possible to compute fingerprints for 19 molecules because of their unusual chemistry (complex macrocycles). These data files were then prepared so as to be compatible with FastTargetPred requirements. The 714,761 ChEMBL chemical compounds with computed fingerprints hit 6,477 macromolecular targets based on the selected criteria. For these ChEMBL compounds a ChEMBL target ID is reported and these target IDs were matched with the corresponding UniProt IDs. Thus, when available, the UniProt ID is provided, the protein UniProt name, the gene name, the organism as well as annotated involvement in diseases, gene ontology data, and cross-references to the Reactome pathway database. As short peptides can be of interest for drug discovery and chemical biology endeavours, we were interested in attempting to predict putative macromolecular targets for a previously reported exhaustive combination of peptides containing four natural amino acids (i.e., 20 × 20 × 20 × 20 = 160,000 linear tetrapeptides) using FastTargetPred and the presently generated ChEMBL29 dataset. With the parameters used, putative targets are reported for 63,944 unique query peptides. These target predictions are provided in two different searchable files with hyperlinks to the ChEMBL, UniProt and Reactome databases.Entities:
Keywords: Drug discovery; Peptide; Target prediction; Virtual screening
Year: 2022 PMID: 35496477 PMCID: PMC9046614 DOI: 10.1016/j.dib.2022.108159
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1Chemical space. Comparison of the chemical space covered by the extracted ChEMBL bioactive compounds (orange), tetrapeptides (blue) and approved drugs (magenta) obtained after filtering DrugBank 5.0 (downloaded in December 2021). 2509 approved drugs were initially collected but 287 very small compounds (molecules with less than 10 heavy atoms) were removed (e.g., several compounds have only 1 atom). In addition, 124 other molecules (unusual chemistry and some mixtures) were deleted. Six physicochemical properties were computed with DataWarrior and a PCA plot was generated. The explained variance percentage of the first 3 principal components are PCA1: 81.79%, PCA2: 13.27%, PCA3: 2.52%.
| Subject | Drug Discovery |
| Specific subject area | Macromolecular target predictions for input small chemical compounds, Target Fishing. |
| Type of data | Text files (CSV, TXT) |
| How the data were acquired | Chemical compounds and corresponding macromolecular targets were extracted from the ChEMBL29 database |
| 160,000 linear tetrapeptides (SMILES strings) were downloaded from | |
| Data format | Raw and processed data: Extracted ChEMBL compounds (canonical SMILES and ChEMBL compound IDs) in TXT format. |
| Description of data collection | The SQLite version of the ChEMBL29 database was downloaded. SQLite terminal shell commands were applied to extract an initial data file (this raw data file is provided in the extra_data folder). Canonical SMILES strings were extracted for small molecules associated with a biological assay involving a single protein or a protein complex, and the selected assay type was “binding”. Only molecules with experimental potency/affinity/activity data (pChEMBL_value) corresponding to 20 micro-molar or less were selected. This initial raw file contained 1,412,822 compounds. Additional filtering involved selecting molecules with a ChEMBL confidence_score of 6 or above, manual curation of mixtures and removal of compounds with obvious errors. Salts were removed with MayaChemTools. The resulting filtered file containing 714,780 unique ChEMBL29 bioactive compounds is provided (available in the extra_data folder). |
| Data source location | • Inserm U1141, Hospital Robert Debre, 75019 |
| Data accessibility | The data are freely available on the Zenodo open access platform and accessible via the following link: |
| Related research article | L. Chaput, V. Guillaume, N. Singh, B. Deprez, B.O. Villoutreix, FastTargetPred: a program enabling the fast prediction of putative protein targets for input chemical databases, Bioinformatics. 36 (2020) 4225–4226. |