Literature DB >> 28346786

Break Down in Order To Build Up: Decomposing Small Molecules for Fragment-Based Drug Design with eMolFrag.

Tairan Liu, Misagh Naderi, Chris Alvin¹, Supratik Mukhopadhyay, Michal Brylinski.

Abstract

Constructing high-quality libraries of molecular building blocks is essential for successful fragment-based drug discovery. In this communication, we describe eMolFrag, a new open-source software to decompose organic compounds into nonredundant fragments retaining molecular connectivity information. Given a collection of molecules, eMolFrag generates a set of unique fragments comprising larger moieties, bricks, and smaller linkers connecting bricks. These building blocks can subsequently be used to construct virtual screening libraries for targeted drug discovery. The robustness and computational performance of eMolFrag is assessed against the Directory of Useful Decoys, Enhanced database conducted in serial and parallel modes with up to 16 computing cores. Further, the application of eMolFrag in de novo drug design is illustrated using the adenosine receptor. eMolFrag is implemented in Python, and it is available as stand-alone software and a web server at www.brylinski.org/emolfrag and https://github.com/liutairan/eMolFrag .

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2017 PMID： 28346786 PMCID： PMC5433162 DOI： 10.1021/acs.jcim.6b00596

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Hit identification, lead generation, and lead optimization are the key steps at the outset of a drug discovery process. Briefly, compounds showing promising activity identified by high-throughput screening as initial hits are filtered and modified to generate lead compounds, which satisfy basic drug-likeliness properties.[1] These lead compounds are further optimized to enhance the potency toward the target protein as well as to reduce their nonselectivity and toxicity.[2] Conventional hit identification is not only limited to already synthesized compounds often leading to low discovery rates, but it is also expensive and requires time-consuming screening experiments.[3] Consequently, virtual screening that can rapidly evaluate millions of compounds has become an integral part of lead identification protocols.[4] In order to enhance the chemical diversity of virtual screening libraries, large collections of drug-like compounds can be generated through combinatorial chemistry.[5] Since constructing and screening the entire chemical space are not feasible even with the most advanced computers, building extensive yet targeted libraries is critical for the success of virtual screening. A number of fragment- and atom-based techniques have been developed to generate novel chemical compounds for virtual screening, including binding-site point connection methods (LUDI[6]), fragment connection methods (LEA3D,[7] LigBuilder,[8] and eSynth[9]), sequential build-up algorithms (LEGEND[10] and SPROUT[11]), and random connection techniques (CoG[12] and Flux[13]). These de novo methods require an initial set of building blocks or molecular fragments, which ultimately control the properties of the resulting screening compounds and their affinity toward the target protein. Consequently, there is a great interest in efficient fragmentation techniques to generate sets of chemically feasible building blocks for the subsequent molecular synthesis. Retrosynthetic combinatorial analysis procedure (RECAP[14]) and breaking retrosynthetically interesting chemical substructures (BRICS[15]) are examples of systematic fragmentation methods. In RECAP, compounds are dissected based on a set of 11 bond types, following simple rules such as leaving cyclic bonds and alkyl groups smaller than five carbons intact. These rules ensure that major structural features of organic compounds, such as ring motifs, are preserved. BRICS expands the bond type criteria used by RECAP from 11 to 16 taking into account the chemical environment of each bond type and the surrounding substructures. Additional filters are also applied in order to prevent generating small and unwanted fragments. Other methods extract and classify chemical scaffolds by pruning side chains and removing peripheral ring moieties.[16] In general, the performance of fragment-based chemical synthesis tools such as eSynth,[9] CONFIRM,[17] and AutoGrow[18] could significantly be improved by employing building blocks annotated with empirical connectivity patterns. Although this information could help explore pharmacologically relevant regions of the diverse chemical space,[9] many existing fragmentation tools, e.g. Fragmenter[19] and molBLOCKS,[20] do not consider the chemical context of the fragments. In other words, the connectivity information on a fragment is not stored while extracting building blocks. To address this issue, we developed eMolFrag, a new open-source molecular fragmentation software. eMolFrag decomposes either a single ligand or a library of compounds into two types of chemical building blocks, bricks and the connecting linkers. The resulting complete and nonredundant sets of building blocks are annotated with the comprehensive connectivity information in order to facilitate the construction of novel compounds with combinatorial synthesis software. eMolFrag has been parallelized to decrease the computing time required to analyze large collections of molecules.

Methods

eMolFrag employs a graph-based notation, where molecules are sets of nodes representing atoms connected by edges corresponding to chemical bonds. A fragment is a substructure, which has either all or only some atoms and bonds of a given molecule; fragments are categorized as either bricks or linkers. Given a collection of molecules, the complete set of unique fragments is constructed in two steps shown in Figure . The first step, labeled as Part I, involves creating an initial set of fragments, whereas the second step, labeled as Part II, guarantees the uniqueness of the resulting set of fragments.

Figure 1

Flowchart of eMolFrag. Part I: Input molecules are fragmented with the BRICS algorithm to generate a complete set of building blocks. Part II: Fragment redundancy is removed according to pairwise chemical alignments with the kcombu program. At the end, nonredundant sets of bricks and linkers are reported along with the consolidated connectivity information as well as lists of similar fragments that were removed.

Part I: Fragmentation

In eMolFrag, a set of molecules are first decomposed into constituent fragments with the BRICS algorithm,[15] implemented in RDKit.[21] Chemical compounds are broken down into larger moieties called bricks connected by linkers based on 16 chemical environments defined by the BRICS model;[15] a pseudocode for the fragmentation process is given in the Supporting Information (Algorithm S1). A brick fragment is a molecular construct having at least four non-hydrogen atoms. Subsequently, bricks are removed from a molecule and the remaining fragments are classified as linkers (see Algorithm S2 in the Supporting Information). Broken bonds are replaced by dummy atoms, which are placeholders for those atoms removed from a particular bond. The complete information, including the type of atoms involved in those bonds that were broken, is stored for each brick in order to provide empirical connectivity patterns. Linkers have different auxiliary connectivity information, i.e. these fragments are annotated only with the maximum number of bonds at various positions. Examples of bricks and linkers are provided in the Supporting Information (Examples S1 and S2, respectively). We found that this approach allows to efficiently construct series of new molecules, whose chemistry is similar to that of parent compounds.

Part II: Mitigation of Fragment Redundancy

Since one of the objectives of an effective fragmentation procedure is to employ the resulting fragments in a synthesis procedure, the cardinality of the final set of fragments is critical. On that account, eMolFrag attempts to minimize the size of sets of bricks and linkers by removing redundancy with a partitioning and sieve-based removal scheme presented in the Supporting Information (Algorithm S3). Two fragments are equivalent if the Tanimoto coefficient (TC) calculated for topologically constrained maximum common substructures by the kcombu program[22] is equal to 1.0. Information on equivalent atoms provided by kcombu as well as their connectivity information is then used to consolidate identical fragments into a single, unique construct.

Results and Discussion

Benchmarks against the DUD-E Database

We validate the eMolFrag algorithm by conducting a self-reconstruction test as described previously.[9] Briefly, given an input molecule m, a set of fragments extracted from m by eMolFrag are passed to a fragment-based construction procedure with eSynth employing its chemical rules.[9] A molecule with the highest chemical similarity to m measured by the TC calculated for Daylight fingerprints[23] is selected from a series of compounds constructed by eSynth. Here, we employ a fingerprint-based assessment of chemical similarity with OpenBabel[24] because this technique is computationally much faster than kcombu. A TC of ≥0.8 indicates that a molecule highly similar to m was generated, whereas a TC of 1.0 indicates that compound m has been reconstructed. As a testing set, we use 20 408 active compounds for 102 protein targets from the Directory of Useful Decoys, Enhanced (DUD-E) database[25] covering a diverse chemical space of pharmacologically relevant molecules. The performance of eMolFrag is compared to molBLOCKS,[20] another fragmentation software employing the RECAP algorithm.[14]Figure shows a two-way box plot of the number of atoms per fragment and the number of fragments per molecule for these two programs. Fragments generated by molBLOCKS typically contain 6–10 (the default protocol) and 6–11 (an extensive mode) atoms, whereas most fragments extracted by eMolFrag consist of 2–7 atoms. The median numbers of fragments per molecule are 3, 8, and 6 for molBLOCKS (default), molBLOCKS (extensive), and eMolFrag, respectively. Finally, molecular synthesis with eSynth[9] was conducted employing fragments generated by eMolFrag for active compounds in the DUD-E database. Encouragingly, 82.8% of active compounds were reconstructed with a TC of 1.0 and 92.2% with a TC of ≥0.8. An inspection of the failed cases revealed that the major reason for not generating a relatively small fraction of testing compounds is the fact that the synthesis software does not allow to directly connect two bricks. Overall, the self-reconstruction benchmarking results demonstrate that eMolFrag properly extracts molecular fragments providing sufficient connectivity information to rebuild the majority of parent molecules.

Figure 2

Two-way box plot of the number of fragments per molecule against the number of atoms per fragment. Bioactive compounds from the Directory of Useful Decoys, Enhanced database were fragmented with eMolFrag (green) and molBLOCKS (red: the default protocol, gray: an extensive mode).

Computational Performance

Decomposing large compound libraries can be time-consuming depending on the number of input molecules; therefore, we parallelized the eMolFrag code. The serial and parallel performance of eMolFrag is assessed by fragmenting subsets of DUD-E actives with sizes varying from 100 to 12 800 molecules. All tests were performed on a machine equipped with two 2.6 GHz 8-core Sandy Bridge Xeon 64-bit processors, 32GB 1666 MHz RAM and 500GB HD, running Red Hat Enterprise Linux 6. Figure shows that the wall time for eMolFrag scales linearly with the number of input molecules. The average processing speed of the serial code ranges from 8.7 molecules/s for the smallest data set to 4.8 molecules/s for the largest data set (see Table S1 in Supporting Information). The actual decomposition speed (Part I in Figure ) is faster for larger sets because the I/O overhead is reduced by efficient data caching. However, removing redundancy (Part II in Figure ) from large data sets requires significantly longer computing times compared to small data sets, which in turn causes the overall speed to decrease with the increasing number of input molecules. Without removing redundancy, the average processing speed of serial eMolFrag is 9.8 molecules/s for the smallest data set and 23.2 molecules/s for the largest data set. For comparison, a serial version of molBLOCKS, which does not remove redundancy, is capable of processing 6.6 and 12.5 molecules/s for the smallest and the largest data sets, respectively. Thus, eMolFrag is 1.2–1.9× faster than molBLOCKS. Further, algorithms implemented in eMolFrag are polynomial in complexity; the best-fit curves in Figure are y = 0.022x1.238 (R2 = 0.99989) for serial and y = 0.013x1.201 (R2 = 0.99989) for parallel execution. This near-linear scaling gives empirical evidence of the efficient implementation of eMolFrag.

Figure 3

Serial and parallel performance of eMolFrag. The main graph shows the wall time for the complete fragmentation procedure plotted against the number of input molecules. A serial code is compared to the parallel processing on 16 computing cores. Parallel scaling for a fixed size input data set of 3200 molecules is presented as the inset. Upper and lower bounds for the ideal speedup calculated according to Amdahl’s law are shown as dark and light gray lines, respectively. The impact of the number of computing cores on parallel processing is assessed by comparing the performance of parallel eMolFrag to the theoretical speedup estimated with Amdahl’s law.[26] The inset in Figure shows that executing eMolFrag in parallel for a fixed input data set of 3200 molecules and the number of computing cores varying from 1 to 16 roughly corresponds to a hypothetical code consisting of 47–60% parallel calculations. Note that eMolFrag does not conform exactly to Amdahl’s law because the workload related to removing redundancy (Part II in Figure ) is unevenly distributed across computing cores. Although the total execution time of eMolFrag diverges from Amdahl’s law, the parallel processing is faster than the serial execution. The average processing speed for the parallel code running on 16 computing cores ranges from 24 molecules per second for the smallest data set to 11.8 molecules per second for the largest data set (see Table S1 in the Supporting Information). This shorter processing time for parallel eMolFrag becomes particularly beneficial for larger data sets. For instance, decomposing 20 408 active compounds from the DUD-E data set for the self-benchmarking test takes 1 h and 18 min on a single core compared to only half an hour on 16 computing cores.

Application to Antagonists of the Adenosine Receptor

To illustrate the application of eMolFrag in de novo drug discovery, we show that bioactive compounds can successfully be constructed from molecular fragments extracted from chemically dissimilar binders of the same target protein. Here, we selected the human adenosine A2a receptor (AA2AR), a member of the G protein-coupled receptor (GPCR) superfamily containing targets for about 27% of all FDA-approved drugs.[27]Figure presents individual steps of the cross-validation procedure, in which CHEMBL144979, a known bioactive ligand for AA2AR,[28] is the target molecule. Four other AA2AR antagonists, called donors, are shown in Figure A. Since the chemical similarity of donors to the target, measured by the TC reported by kcombu, is lower than 0.5, CHEMBL144979 can be considered novel with respect to the donor molecules.

Figure 4

Example of the successful construction of a bioactive of the adenosine receptor by eMolFrag and eSynth. (A) Donor molecules with the chemical similarity to CHEMBL144979 measured by the Tanimoto coefficient (TC). (B) Bricks annotated with the list of atom types that can be attached at various positions. (C) Linkers annotated with the number of the maximum allowed connections. (D) Examples of new molecules synthesized using bricks and linkers. The first molecule shown in a box is a known bioactive of the adenosine receptor. Highlighted in different colors are essential building blocks to generate CHEMBL144979 that are extracted from donor molecules by eMolFrag and used in molecular synthesis by eSynth. Further, the connectivity information inferred from donors that is required to correctly assemble CHEMBL144979 is highlighted in bold in B and C. Unique sets of 10 bricks and 7 linkers extracted by eMolFrag from 4 donors are shown in Figures B and C, respectively. For instance, the triazolo-quinazoline fragment highlighted in pink carrying the chlorine moiety was obtained from CHEMBL95229. This compound is a member of a series of pyrazolo-triazolo-pyrimidines with subnanomolar affinity against ARs created via N5-phenylcarbamoyl substitutions.[29] Bricks contain information on atom types that can be attached at various positions (small boxes in Figure B), whereas linkers are annotated with the maximum number of allowed bonds (small circles in Figure C). The sets of bricks and linkers are complete and nonredundant, i.e. each unique fragment carries the connectivity information extracted from multiple donor compounds. For example, the connectivity information for a benzene ring, which is present in all donors, is consolidated by eMolFrag into a single fragment shown in cyan in Figure B. Subsequently, molecular fragments extracted by eMolFrag were passed to eSynth[9] in order to generate a series of compounds. A serial version of eSynth produced 4 492 609 virtual compounds in 12 h. Encouragingly, the first compound in Figure D (shown in a box) is CHEMBL144979; therefore, the target molecule has been successfully constructed. Further, the set of virtual molecules comprises 845 compounds, whose TC to CHEMBL144979 is ≥0.7 and as many as 239 656 molecules with a TC of ≥0.5. Three randomly selected virtual molecules are presented in Figure D to demonstrate the chemical diversity of compounds generated by eSynth. It is important to note that these retrospective cross-validation benchmarks are designed to mimic real applications by attempting to construct target molecules using building blocks extracted from chemically dissimilar compounds. This case study demonstrates that high-quality fragment sets generated by eMolFrag can be used in fragment-based drug discovery to create targeted screening libraries likely containing novel bioactives.

Conclusions

eMolFrag is a fast and robust tool to extract molecular fragments, classified as bricks and linkers, from small molecule data sets. Subsequently, these fragments can be used to construct targeted libraries for virtual screening. A unique feature of eMolFrag is that it stores the connectivity information for the extracted building blocks to help generate new series of chemically feasible compounds. Although eMolFrag was optimized to work with eSynth, a recently developed molecular synthesis algorithm, it can also be integrated into other cheminformatics toolkits utilizing chemical fragments. eMolFrag is freely available as stand-alone software and a Web server at www.brylinski.org/emolfrag and https://github.com/liutairan/eMolFrag.

23 in total

1. Virtual screening: a real screening complement to high-throughput screening.

Authors: J Mestres
Journal: Biochem Soc Trans Date: 2002-08 Impact factor: 5.407

2. A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules.

Authors: Nathan Brown; Ben McKay; François Gilardoni; Johann Gasteiger
Journal: J Chem Inf Comput Sci Date: 2004 May-Jun

Review 3. Navigating chemical space for biology and medicine.

Authors: Christopher Lipinski; Andrew Hopkins
Journal: Nature Date: 2004-12-16 Impact factor: 49.962

4. LEA3D: a computer-aided ligand design for structure-based drug design.

Authors: Dominique Douguet; Hélène Munier-Lehmann; Gilles Labesse; Sylvie Pochet
Journal: J Med Chem Date: 2005-04-07 Impact factor: 7.446

Review 5. How many drug targets are there?

Authors: John P Overington; Bissan Al-Lazikani; Andrew L Hopkins
Journal: Nat Rev Drug Discov Date: 2006-12 Impact factor: 84.694

6. Flux (2): comparison of molecular mutation and crossover operators for ligand-based de novo design.

Authors: Uli Fechner; Gisbert Schneider
Journal: J Chem Inf Model Date: 2007-02-23 Impact factor: 4.956

7. Derivatives of the triazoloquinazoline adenosine antagonist (CGS 15943) having high potency at the human A2B and A3 receptor subtypes.

Authors: Y C Kim; M de Zwart; L Chang; S Moro; J K von Frijtag Drabbe Künzel; N Melman; A P IJzerman; K A Jacobson
Journal: J Med Chem Date: 1998-07-16 Impact factor: 7.446

8. Pyrazolo[4,3-e]1,2,4-triazolo[1,5-c]pyrimidine derivatives as highly potent and selective human A(3) adenosine receptor antagonists: influence of the chain at the N(8) pyrazole nitrogen.

Authors: P G Baraldi; B Cacciari; R Romagnoli; G Spalluto; S Moro; K N Klotz; E Leung; K Varani; S Gessi; S Merighi; P A Borea
Journal: J Med Chem Date: 2000-12-14 Impact factor: 7.446

9. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

Authors: Michael M Mysinger; Michael Carchia; John J Irwin; Brian K Shoichet
Journal: J Med Chem Date: 2012-07-05 Impact factor: 7.446

10. molBLOCKS: decomposing small molecule sets and uncovering enriched fragments.

Authors: Dario Ghersi; Mona Singh
Journal: Bioinformatics Date: 2014-03-28 Impact factor: 6.937

10 in total

1. Consensus queries in ligand-based virtual screening experiments.

Authors: Francois Berenger; Oanh Vu; Jens Meiler
Journal: J Cheminform Date: 2017-11-28 Impact factor: 5.514

2. Machine Learning Algorithm Identifies an Antibiotic Vocabulary for Permeating Gram-Negative Bacteria.

Authors: Rachael A Mansbach; Inga V Leus; Jitender Mehla; Cesar A Lopez; John K Walker; Valentin V Rybenkov; Nicolas W Hengartner; Helen I Zgurskaya; S Gnanakaran
Journal: J Chem Inf Model Date: 2020-06-09 Impact factor: 4.956

3. Exploring fragment-based target-specific ranking protocol with machine learning on cathepsin S.

Authors: Yuwei Yang; Jianing Lu; Chao Yang; Yingkai Zhang
Journal: J Comput Aided Mol Des Date: 2019-11-15 Impact factor: 3.686

Review 4. Computational Fragment-Based Drug Design: Current Trends, Strategies, and Applications.

Authors: Yuemin Bian; Xiang-Qun Sean Xie
Journal: AAPS J Date: 2018-04-09 Impact factor: 4.009

5. Chemical Space Expansion of Bromodomain Ligands Guided by in Silico Virtual Couplings (AutoCouple).

Authors: Laurent Batiste; Andrea Unzue; Aymeric Dolbois; Fabrice Hassler; Xuan Wang; Nicholas Deerain; Jian Zhu; Dimitrios Spiliotopoulos; Cristina Nevado; Amedeo Caflisch
Journal: ACS Cent Sci Date: 2018-02-07 Impact factor: 14.553

6. eToxPred: a machine learning-based approach to estimate the toxicity of drug candidates.

Authors: Limeng Pu; Misagh Naderi; Tairan Liu; Hsiao-Chun Wu; Supratik Mukhopadhyay; Michal Brylinski
Journal: BMC Pharmacol Toxicol Date: 2019-01-08 Impact factor: 2.483

7. CReM: chemically reasonable mutations framework for structure generation.

Authors: Pavel Polishchuk
Journal: J Cheminform Date: 2020-04-22 Impact factor: 5.514

Review 8. Artificial intelligence for the discovery of novel antimicrobial agents for emerging infectious diseases.

Authors: Adam Bess; Frej Berglind; Supratik Mukhopadhyay; Michal Brylinski; Nicholas Griggs; Tiffany Cho; Chris Galliano; Kishor M Wasan
Journal: Drug Discov Today Date: 2021-11-05 Impact factor: 7.851

9. Towards systematic exploration of chemical space: building the fragment library module in molecular property diagnostic suite.

Authors: Anamika Singh Gaur; Lijo John; Nandan Kumar; M Ram Vivek; Selvaraman Nagamani; Hridoy Jyoti Mahanta; G Narahari Sastry
Journal: Mol Divers Date: 2022-08-04 Impact factor: 3.364

10. KLIFS: an overhaul after the first 5 years of supporting kinase research.

Authors: Georgi K Kanev; Chris de Graaf; Bart A Westerman; Iwan J P de Esch; Albert J Kooistra
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10 in total