Literature DB >> 26309399

Navigating the chemical space of dipeptidyl peptidase-4 inhibitors.

Watshara Shoombuatong1, Veda Prachayasittikul2, Nuttapat Anuwongcharoen1, Napat Songtawee1, Teerawat Monnor1, Supaluk Prachayasittikul1, Virapong Prachayasittikul3, Chanin Nantasenamat2.   

Abstract

This study represents the first large-scale study on the chemical space of inhibitors of dipeptidyl peptidase-4 (DPP4), which is a potential therapeutic protein target for the treatment of diabetes mellitus. Herein, a large set of 2,937 compounds evaluated for their ability to inhibit DPP4 was compiled from the literature. Molecular descriptors were generated from the geometrically optimized low-energy conformers of these compounds at the semiempirical AM1 level. The origins of DPP4 inhibitory activity were elucidated from computed molecular descriptors that accounted for the unique physicochemical properties inherently present in the active and inactive sets of compounds as defined by their respective half maximal inhibitory concentration values of less than 1 μM and greater than 10 μM, respectively. Decision tree analysis revealed the importance of molecular weight, total energy of a molecule, topological polar surface area, lowest unoccupied molecular orbital, and number of hydrogen-bond donors, which correspond to molecular size, energy, surface polarity, electron acceptors, and hydrogen bond donors, respectively. The prediction model was subjected to rigorous independent testing via three external sets. Scaffold and chemical fragment analysis was also performed on these active and inactive sets of compounds to shed light on the distinguishing features of the functional moieties. Docking of representative active DPP4 inhibitors was also performed to unravel key interacting residues. The results of this study are anticipated to be useful in guiding the rational design of novel and robust DPP4 inhibitors for the treatment of diabetes.

Entities:  

Keywords:  QSAR; antidiabetic; decision tree; fragment analysis; molecular docking; rational drug design; scaffold analysis

Mesh:

Substances:

Year:  2015        PMID: 26309399      PMCID: PMC4539085          DOI: 10.2147/DDDT.S86529

Source DB:  PubMed          Journal:  Drug Des Devel Ther        ISSN: 1177-8881            Impact factor:   4.162


Introduction

Diabetes is a chronic disease and a major public health concern with an estimated global prevalence of 285 million.1 In the United States, 29.1 million (or approximately 9.3% of the population) have diabetes, in which 21 million and 8.1 million are diagnosed and undiagnosed, respectively.2 In fact, the estimated economic costs of diagnosed diabetes in the United States for 2012 was $245 billion, which increased from $174 billion in 2007.3 Given the multifaceted nature of diabetes, the search for robust drugs has been reported to entail a multitude of molecular targets.4,5 Dipeptidyl peptidase-4 (DPP4) has emerged as a promising therapeutic route for the treatment of type 2 diabetes (T2D) because it regulates glucose homeostasis.6 DPP4 is a serine protease that mediates the cleavage of two endogenous incretin hormones consisting of glucagon-like peptide and glucose-dependent insulinotropic polypeptide. Upon food ingestion, intestinal cells secrete these incretin hormones targeting pancreatic β-cells to stimulate insulin release. Generally, these two hormones exert a great effect on reducing blood glucose concentration; however, the rapid degradation of these hormones by DPP4 in T2D results in persistent high glucose level.7 Therefore, the inhibition of DPP4 reduces blood glucose by preventing the degradation of these incretin hormones. Several DPP4 inhibitors have been released on the market, beginning with sitagliptin in 2006, vildagliptin in 2007, saxagliptin in 2009, alogliptin in 2010, linagliptin in 2011, and, finally, teneligliptin in 2012.8 Generally, DPP4 inhibitors are considered to afford a favorable safety profile,9,10 although rare side effects (ie, angioedema, hemolysis, leucopenia, rheumatoid arthritis, and drug-induced acute hepatic injury) have been documented but with low incidence.11 Thus, there is ample room for additional improvement of the inhibitory and pharmacokinetic properties of DPP4 inhibitors. Medicinal chemistry approaches have been instrumental in the development of DPP4 inhibitors by facilitating the investigation of substituent effects in the quest for improved potency.8,12 Complementing the effort of medicinal chemistry is computer-aided drug design, of which chemical space exploration and quantitative structure–activity relationship (QSAR) methods are employed in this study. The former entails exploration of the chemical space to gain insights on the molecular complexity of investigated compounds. The latter enables the correlation of molecular structure with its respective biological activity via multivariate learning methods.13,14 The availability of public databases of bioactivity significantly lowers the barriers for large-scale investigation of the structure–activity relationship for compounds of interest15,16 and leads to accelerated drug discovery efforts. This study takes advantage of bioactivity data compilation of DPP4 inhibitors available from the BindingDB.17 To the best of our knowledge, this study represents the first large-scale chemical space exploration and QSAR investigation of DPP4 inhibitory activity. Chemical space exploration was achieved by exploratory data analysis, cluster analysis, and chemical substructure analysis, whereas QSAR analysis was performed using decision tree (DT) analysis. A schematic representation of the computational workflow is summarized in Figure 1.
Figure 1

Schematic representation of the computational workflow.

Abbreviation: QSAR, quantitative structure–activity relationship.

Material and methods

Compilation of the dataset

A large compilation of known compounds with inhibitory activity against DPP4 was extracted from the BindingDB,17 which constituted 138 original articles. This nonredundant dataset comprises 2,937 compounds with the associated bioactivity reported as half maximal inhibitory concentration (IC50) values. An IC50 cutoff value of ≤1 μM was employed to categorize compounds as “actives”, whereas a cutoff value of ≥10 μM was utilized to categorize compounds as “inactives”, which resulted in subsets of 2,075 and 534, respectively. The remaining 328 compounds exhibiting intermediate bioactivity were not considered in this study due to their dubious nature while the subset of 2,609 was subjected to further investigations. Data imbalance observed for the active and inactive classes was addressed by subjecting the 2,075 actives to fuzzy C-means clustering,18 which produced a final dataset consisting of 588 actives and 534 inactives (DPP4-TRN). The constructed predictive model was rigorously validated against three external validation sets. To show the ability of predictive models for filtering inactives, in the present study, three external validation sets were employed as negative control and were compiled from the BindingDB as follows: 1) random selection of active and inactive inhibitors against a wide range of human target proteins (DPP4-TEST1); 2) random selection of active and inactive inhibitors against other human proteases (DPP4-TEST2); and 3) random selection of active and inactive inhibitors against other human DPP types such as DPP1, DPP2, and DPP7 (DPP4-TEST3). According to the applicability domain, the robustness of a QSAR model applies well for predicting the activity of compounds belonging to similar chemotypes as those used as the training data for constructing the predictive model.19 Thus, applicability domain was applied by selecting compounds to include in the external validation sets. Tanimoto coefficient is a commonly used metric for measuring the similarity between compounds of the internal and external sets, which varies between 0 (total lack of similarity) and 1 (compound from the internal set is identical to a compound in the external set). Herein, the average Tanimoto coefficient value was used as the cutoff for selection of compounds to include in the external validation sets.20–22 Finally, the remaining DPP4-TEST1, DPP4-TEST2, and DPP4-TEST3 consisted of 149, 160, and 167 compounds, respectively.

Calculation of molecular descriptors

The molecular structures of the investigated compounds were converted to three-dimensional structures from their simplified molecular-input line-entry system notation using MarvinSketch, version 6.2.1, from ChemAxon (ChemAxon Ltd., Budapest, Hungary).23 The file format of these structures was then converted to the appropriate file format using Babel, version 3.3,24 for subsequent geometry optimization at the B3LYP/6-31G(d) level in Gaussian 09.25 Our previous chemical space exploration of aromatase inhibitors was performed using a set of 13 descriptors selected to represent the general properties of a molecule.26 Given the readily interpretative nature, this set of descriptors was also employed for this investigation. This set of descriptors included the following: 1) mean absolute charge (Qm); 2) energy; 3) dipole moment; 4) highest occupied molecular orbital (HOMO); 5) lowest unoccupied molecular orbital (LUMO); 6) energy gap between the HOMO and LUMO states (HOMO–LUMO); 7) molecular weight (MW); 8) rotatable bond number (RBN); 9) number of rings (nCIC); 10) number of hydrogen bond donors (nHDon); 11) number of hydrogen bond acceptors (nHAcc); 12) Ghose–Crippen octanolwater partition coefficient (ALogP); and 13) topological polar surface area (TPSA).

Univariate analysis

Univariate statistical approaches were employed to perform exploratory data analysis. Specifically, six descriptive statistical parameters were used to summarize the aforementioned set of 13 descriptors. These parameters consisted of the minimum (Min), first quartile (Q1), median, mean, third quartile (Q3), and maximum (Max) of the dataset. Box plots were applied to visualize the relative distribution of the values for each investigated variable; this involved the analysis of a set of 13 descriptors to identify the descriptors that exert great influence on the active and inactive classes of DPP4 inhibitors. Histograms were used to visualize and estimate the distribution of active and inactive classes of DPP4 inhibitors. Furthermore, the P-value was used to assess whether active and inactive classes of DPP4 inhibitors were significantly different using Student’s t-test.27

Principal component analysis

Principal component analysis (PCA) is an unsupervised learning approach that groups data into related clusters in an a priori fashion. Practically, the PCA approach reduces the dimensionality of the dataset, while most of the information of the original dataset is preserved.28 This approach is performed by identifying directions, so-called principal components (PCs), along which variation in the data is maximal. In practice, PCs are obtained by calculating eigenvectors and eigenvalues of a data covariance (or correlation) matrix. The eigenvector associated with the largest eigenvalue has a direction that is identical to the first PC (PC1), whereas the eigenvector associated with the second largest eigenvalue determines the direction of the second PC (PC2) and so forth. In performing PCA analysis, a dataset is represented by a small number of PCs, in contrast to the initially large number of variables present in the original dataset.29 In this study, PCA was performed on a set of 13 molecular descriptors, as described in the previous section. Prior to PCA analysis, all data were standardized to a comparable scale by transforming variables to zero mean and unit variance. Active and inactive classes of DPP4 inhibitors were individually calculated using the FactoMineR30 package of the R statistical language.

DT analysis

A DT is composed of a hierarchical arrangement of nodes and branches in which the nodes represent the molecular descriptors, whereas the branches refer to decision rules to categorize compounds as actives and inactives. DT has been successfully applied in the analysis of various types of compounds, such as aromatase inhibitors,26 volatile organic compounds,31 and cytochrome P450-interacting compounds.32 A DT was constructed with WEKA, version 3.6,33 using the J48 algorithm (a Java implementation of the C4.5 algorithm). C4.5 establishes a DT by iteratively appending features having high information gains.34 Finally, C4.5 automatically calculates the feature usage obtained from the full DT or collection of rules. Molecular descriptors having the highest feature usage are considered to be the most important features.

Chemical substructure analysis

In preparation for substructure analysis, the chemical structures of all DPP4 inhibitors were generated in structure-data file (SDF) format using MarvinSketch, followed by appending the bioactivity label to the SDF files using an in-house text processing tool coded in C++. Substructure analysis was performed using the Fragmenter and FragmentStatistics components of JChem version 14.8.18.0.35 Fragmenter processed the activity-tagged SDF file by generating molecular fragments according to the FragmenterAll protocol. Produced fragments were analyzed using the FragmentStatistics toolkit, whereby fragments were categorized as actives and inactives using pIC50 cutoff values of 6 and 5, respectively. Subsequently, fragments were assigned molecular scores according to the following equation: where N denotes the atom count of a given fragment of interest, whereas Nactive and Ninactive represent the number of occurrences of the fragment in the active and inactive classes, respectively.

Molecular docking and binding mode analysis

Molecular docking was performed to gain insights on how the inhibitors bind DPP4. Geometrically optimized structures of each compound were docked with the crystal structure of DPP4 catalytic domain (PDB code 3C45, resolution of 2.05 Å) using AutoDock version 4.2.6,36 in which the rotational bonds of compounds were treated as flexible whereas those of DPP4 were rigid. United atom model was applied to both protein and ligand structures. Grid boxes were created to cover the inhibitor-binding site of the protein with the grid spacing of 0.375 Å while the co-crystalized ligand site was set as the center of the box. The Lamarckian genetic algorithm with 50 runs was used as the search parameter in which the population size was set at 150 and the Max number of energy evaluations was set to the high level. The anchor-binding mode of ligand docking poses with the lowest binding energy to the DPP4 active site was subsequently analyzed by the SiMMap server.37 Three-dimensional models of the binding mode were visualized with PyMOL version 1.3.38

Results and discussion

Univariate analysis of active and inactive DPP4 inhibitors

The number of active and inactive DPP4 inhibitors compiled in this study was 2,075 and 534, respectively. Table 1 displays the six descriptive statistical parameters that offer the following advantages for summarizing the data: 1) the median and mean provide a measure of the centrality of the data; 2) the Min and Max indicate the data range; and 3) Q1 and Q3 provide the lower and upper boundaries, respectively, of the data. Furthermore, histograms shown in Figure 2 afford a graphical display of the data as tabulated frequencies of bars derived by binning continuous values into several data ranges. Figure 2A shows the distribution of active and inactive DPP4 inhibitors as red and blue bars, respectively, whereas the overlapping region is shown in purple. Figure 2B, which will be discussed in further details in the “Analysis of active DPP4 inhibitors” section, displays the distribution of two subsets of active DPP4 inhibitors that will be referred to as active I and active II.
Table 1

Exploratory data analysis of actives and inactives using the six-term descriptive statistics

StatisticsMWRBNnCICnHDonnHAccALogPTPSAQmEnergyDipole momentHOMOLUMOHOMO–LUMO
Actives
 Min167.20.0000.0000.0001.000−2.93629.2600.137−0.9080.747−0.572−0.2980.217
 Q1340.94.0003.0002.0005.0000.58672.8000.202−0.2604.075−0.354−0.0390.301
 Median386.55.0003.0003.0007.0001.57185.2500.217−0.1235.831−0.343−0.0250.314
 Mean385.85.0083.1552.7356.8971.56689.4900.222−0.1449.842−0.352−0.0350.318
 Q3430.56.0004.0003.0008.0002.586103.6600.236−0.0178.111−0.331−0.0080.332
 Max753.816.0006.0009.00016.0006.598234.7800.5350.488284.562−0.2860.0470.386
Inactives
 Min128.21.0000.0000.0001.000−2.4853.2400.142−1.2810.629−0.490−0.1540.242
 Q1238.43.0002.0001.0004.0000.84847.7200.193−0.1722.890−0.344−0.0220.310
 Median303.94.0002.0002.0005.0001.80672.3500.209−0.0973.961−0.338−0.0050.331
 Mean315.14.6052.6092.3754.9911.85972.0020.213−0.1194.443−0.337−0.0070.331
 Q3359.56.0003.0003.0006.0002.94988.8400.231−0.0485.262−0.3290.0110.347
 Max1,174.636.0006.00011.00025.0007.528351.8100.3460.13942.433−0.2900.1000.414

Abbreviations: ALogP, Ghose-Crippen octanol-water partition coefficient; HOMO, highest occupied molecular orbital; HOMO-LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; Max, maximum; Min, minimum; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; Q1, first quartile; Q3, third quartile; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Figure 2

Histograms of the molecular descriptors for actives/inactives (A) and active I/active II DPP4 inhibitors (B).

Notes: Actives/active I and inactives/active II are shown in red and blue, respectively; purple regions represent their overlap.

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

MW is a general measure of the molecular size, and actives were found to be larger than inactives, with P<0.001, Q1 =340.9, median =386.5, mean =385.8, and Q3 =430.5 for actives, and Q1 =238.4, median =303.9, mean =315.1, and Q3 =359.5 for inactives (Table 1). As shown in Figure 2A, the distributions of actives and inactives were normal and positively skewed, respectively. RBN is the number of rotatable bonds in a molecule and provides a relative measure of molecular flexibility. RBN is defined as any single bond, not in a ring, bound to a nonterminal heavy atom. Amide C–N bonds are excluded from the count because of their high rotational energy barrier. Actives were found to have higher RBN values than their inactive counterparts, thereby implying the importance of molecular flexibility for DPP4 inhibitory activity. Corresponding values of Q1 =4.0, median =5.0, mean =5.0, and Q3 =6.0 were obtained for actives, whereas values of Q1 =3.0, median =4.0, mean =4.6, and Q3 =6.0 were obtained for inactives. Although the distribution of actives and inactives are both positively skewed, the RBN values for actives are greater than those for inactives. Remarkably, all of these results indicated that the number of rotatable bonds in a molecule between active and inactive DPP4 inhibitors was slightly different, with P=0.001. The nCIC is calculated as the cardinality of the set of independent rings known as the smallest set of smallest rings. The nCIC from actives was higher than that from inactives (P<0.001), affording values of Q1 =3.000, median =3.000, mean =3.155, and Q3 =4.000 for actives, and Q1 =2.000, median =2.000, mean =2.609, and Q3 =3.000 for inactives. nHDon is the number of hydrogen bond donors present in a molecule. The mean of nHDon in actives (2.735±1.202) was higher than that in inactives (2.375±1.355). A six-number statistical descriptive confirmed that active and inactive DPP4 inhibitors differed from each other, with values in the range of [0.000, 9.000] and [0.000, 11.000], respectively; for active DPP4 inhibitors, median =3.000, mean =2.735, and Q3 =3.000, and for inactive DPP4 inhibitors, median =2.000, mean =2.373, and Q3 =3.000. Furthermore, the histogram for active DPP4 inhibitors does not differ from that of inactive DPP4 inhibitors. Notably, all of these results indicated that the nHDon between active and inactive DPP4 inhibitors was significantly different, with P<0.001. nHAcc represents the number of hydrogen bond acceptors present in a molecule. The mean values of rotatable bonds of DPP4 inhibitors are in the range of 6.897±2.122 (active) and 4.991±2.217 (inactive), whereas the values of descriptive statistics are Min =1.000, Q1 =5.000, median =7.000, mean =6.897, Q3 =8.000, and Max =16.000 for active DPP4 inhibitors, and Min =1.000, Q1 =4.000, median =5.000, mean =4.991, Q3 =6.000, and Max =25.000 for inactive DPP4 inhibitors. The histograms of these two inhibitor classes were found to differ from each other. These results indicated that the nHAcc for active and inactive DPP4 inhibitors was significantly different, with P<0.001. ALogP is a computational estimation of the logarithm of the 1-octanol/water partition coefficient, and it is a well-known measure of molecular hydrophobicity. The mean values of ALogP are 1.556±1.488 and 1.859±1.691 for active and inactive DPP4 inhibitors, respectively, which are different, and the values of descriptive statistics confirm this finding, with values of Min =−2.936, Q1 =0.586, median =1.571, mean =1.566, Q3 =2.586, and Max =6.598 for active DPP4 inhibitors, and Min =−2.485, Q1 =0.848, median =1.806, mean =1.859, Q3 =2.949, and Max =7.528 for inactive DPP4 inhibitors. Additionally, the histograms of active and inactive DPP4 inhibitors were significantly different, with P<0.001. TPSA is an empirical measure of the polar surface area of a molecule, and it describes the contribution of polar atoms to the molecular charge. TPSA is frequently used in the study of drug transport properties such as intestinal absorption10 and blood–brain barrier permeability.11 High TPSA, in addition to indicating that the molecule possesses a complex surface charge environment, also indicates that the molecule inherently possesses poor membrane permeability and would need to rely on active transport, such as membrane-bound receptors. The mean value of active DPP4 inhibitors (89.490±26.130) is greater than that of inactive DPP4 inhibitors (72.002±32.154); moreover, a six-number statistical descriptive confirms that the characteristics of active and inactive DPP4 inhibitors differ, with Min =29.260, Q1 =72.800, median =85.250, mean =89.490, Q3 =103.660, and Max =234.780 for active DPP4 inhibitors, and Min =3.240, Q1 =47.720, median =72.350, mean =72.002, Q3 =88.840, and Max =351.810 for inactive DPP4 inhibitors. These results indicated that the overall pattern of active and inactive DPP4 inhibitors, including the histogram shape in Figure 2A, were significantly different, with P<0.001. Qm is a global measure of the molecular charge. The mean values of active and inactive DPP4 inhibitors are 0.222±0.034 and 0.213±0.030, respectively. Histograms of these two inhibitor classes were significantly different, with P<0.001. A six-number statistical descriptive confirms this finding, with range values of [0.137, 0.535] for active DPP4 inhibitors and [0.142, 0.346] for inactive DPP4 inhibitors, whereas the top quartiles are [0.202, 0.236] for active DPP4 inhibitors and [0.193, 0.231] for inactive DPP4 inhibitors. Energy is the sum of the atomic energy. The mean values of active and inactive DPP4 inhibitors are −0.144±0.183 and −0.119±0.129, respectively. Notably, the distributions of these two inhibitor classes are significantly different, with P<0.001. Furthermore, the six-number statistical descriptive indicates that active DPP4 inhibitors differ from inactive DPP4 inhibitors, ie, Min =−0.908, Q1 =−0.260, median =−0.123, mean =−0.144, Q3 =−0.017, and Max =0.488 for active DPP4 inhibitors, whereas Min =−1.281, Q1 =−0.172, median =−0.097, mean =−0.119, Q3 =−0.048, and Max =0.139 for inactive DPP4 inhibitors. The dipole moment is a measure of the asymmetric distribution of charge in a molecule, where a low value suggests minimal charge distribution and vice versa. Table 1 indicates that the average value of active DPP4 inhibitors (9.842±15.038) is greater than that of inactive DPP4 inhibitors (4.443±3.303). The different patterns of these two DPP4 inhibitor classes are also indicated by a six-number statistical descriptive. The 6-number statistical descriptive of active DPP4 inhibitors consisted of Min =0.747, Q1 =4.075, median =5.831, mean =9.842, Q3 =8.111, and Max =284.562, whereas that of inactive DPP4 inhibitors consisted of Min =0.629, Q1 =2.890, median =3.961, mean =4.443, Q3 =5.262, and Max =42.433. The ranges of active and inactive DPP4 inhibitors were dramatically different, with values of [0.747, 284.562] and [0.629, 42.433], respectively, as shown in the corresponding histograms. Notably, these results indicated that the characteristics of active and inactive DPP4 inhibitors were significantly different, with P<0.001. The HOMO and LUMO are the highest- and lowest-energy molecular orbitals that are occupied by electrons. The mean values of HOMO and LUMO in active and inactive DPP4 inhibitors are −0.352±0.038/−0.337±0.019 and −0.035±0.049/−0.007±0.031, respectively. The values of HOMO range from [−0.572, −0.286] for active DPP4 inhibitors and [−0.490, −0.290] for inactive DPP4 inhibitors, whereas the values of LUMO range from [−0.289, 0.047] for active DPP4 inhibitors and [−0.154, 0.100] for inactive DPP4 inhibitors. The top quartiles for HOMO are [−0.354, −0.331] for active DPP4 inhibitors and [−0.344, −0.329] for inactive DPP4 inhibitors, whereas the top quartiles for LUMO are [−0.039, −0.008] for active DPP4 inhibitors and [−0.022, 0.011] for inactive DPP4 inhibitors. Remarkably, the histograms of HOMO and LUMO indicate that the distributions of active and inactive DPP4 inhibitors are significantly different, with P<0.001. HOMO–LUMO is the energetic difference between the HOMO and LUMO states. HOMO–LUMO is a measure of kinetic stability and chemical reactivity, as HOMO and LUMO descriptors play fundamental roles in electron donation and acceptance. A large gap suggests high kinetic stability and low chemical reactivity because it is energetically unfavorable to add electrons to a high-lying LUMO or to extract electrons from a low-lying HOMO to form the activated complex of a potential reaction. Conversely, a molecule with a small or no HOMO–LUMO is chemically reactive. The mean values of HOMO–LUMO are 0.318±0.026 and 0.331±0.029 for active and inactive DPP4 inhibitors, respectively. The distributions of active and inactive DPP4 inhibitors are quite different. Additionally, the six-number statistical descriptive confirms this finding, with range values of [0.217, 0.386] for active DPP4 inhibitors and [0.242, 0.414] for inactive DPP4 inhibitors, whereas the lower and upper boundaries are [0.301, 0.332] for active DPP4 inhibitors and [0.310, 0.347] for inactive DPP4 inhibitors. These results indicate that the characteristics of active and inactive DP4 inhibitors were significantly different, with P<0.001. All of these results indicated that nearly all of the 13 descriptors were significantly different between the two inhibitor classes at the level of P<0.001 except for RBN (P=0.001). With the exception of RBN descriptors, the remaining descriptors are significantly different for active and inactive DPP4 inhibitors and are efficient for discrimination.

PCA analysis of active and inactive DPP4 inhibitors

In this study, the 13 descriptors were analyzed by utilizing the first three PCs because the amount of cumulative variation of these PCs is as high as 70% of the original variance, as shown in Figure S1. Scores and loadings plots are presented in Figure 3A for actives (top row) and inactives (bottom row, bottom-left). Tables S1 and S2 show the loadings and contribution values, respectively, of each descriptor to the component. The contribution value of each descriptor can be obtained by the ratio of the squared factor score of this observation by the eigenvalue associated with that component.12
Figure 3

PCA scores plots of actives/inactives (A) and active I/active II (B) DPP4 inhibitors.

Note: The scores and loadings plots are shown in the left and right panels, respectively, where actives/active I and inactives/active II DPP4 inhibitors are shown in the top and bottom rows, respectively.

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PCA, principle component analysis; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

PC1 retained 27.93% and 35.06% of the original variance for active and inactive DPP4 inhibitors, respectively. Figure S1 indicates that the percentage of variance of inactive DPP4 inhibitors was higher than that of active DPP4 inhibitors. In Table S1 and Figure 3A (top-right), PC1 separates HOMO–LUMO from MW, RBN, nHDon, nHAcc, and TPSA for active DPP4 inhibitors, whereas in Figure 3A (bottom-right), PC1 separates energy from MW, RBN, nHAcc, TPSA, and Qm for inactive DPP4 inhibitors. For loadings score analysis, PC1 highly correlated with MW (0.849), RBN (0.638), nHDon (0.578), nHAcc (0.637), TPSA (0.693), and HOMO–LUMO (−0.575) for active DPP4 inhibitors, whereas in inactive DPP4 inhibitors, PC1 highly correlated with MW (0.849), RBN (0.664), nHAcc (0.893), TPSA (0.815), Qm (0.601), and energy (−0.654). These results indicated that PC1 correlated most strongly with MW and nHAcc for active and inactive DPP4 inhibitors, respectively. Furthermore, Table S2 also indicates that the MW descriptor highly contributes to PC1 for active DPP4 inhibitors, whereas the nHAcc descriptor highly contributes to PC1 for inactive DPP4 inhibitors. Descriptors consisting of nHDon, Qm, energy, and HOMO–LUMO influenced PC1 for either active or inactive DPP4 inhibitors. Interestingly, the four differential descriptors are reported with P<0.001 and are significantly different between active and inactive inhibitor classes. It may be assumed that these four differential descriptors represent the informative features that discriminate between active and inactive DPP4 inhibitors. PC2, which is the direction uncorrelated with PC1, retained 20.82% and 20.82% of the original variance for active and inactive DPP4 inhibitors, respectively. Figure S1 indicates that the first two components can preserve 51.29% and 55.88% of the original variance of active and inactive DPP4 inhibitors, respectively. The results indicated that the percentage and cumulative percentage of variance for inactive DPP4 inhibitors were greater than those for active DPP4 inhibitors. In Table S1 and Figure 3A (top-right), PC2 separates dipole moment and energy from HOMO and LUMO for active DPP4 inhibitors, whereas in Figure 3A (bottom-right), PC2 of inactive DPP4 inhibitors separates ALogP and nCIC from nHDon and HOMO–LUMO. The PCA loadings scores indicate that PC2 highly correlated with energy (−0.765), dipole moment (−0.617), HOMO (0.637), and LUMO (0.732) for active DPP4 inhibitors, whereas for inactive DPP4 inhibitors, PC2 highly correlated with nCIC (−0.651), nHDon (0.605), ALogP (−0.813), and HOMO–LUMO (0.608). Furthermore, Table S2 indicates that the LUMO (17.664) and ALogP (24.441) descriptors highly contribute to PC1 for active and inactive DPP4 inhibitors, respectively. Table S1 indicates that descriptors consisting of nCIC, nHDon, ALogP, energy, dipole moment, HOMO, LUMO, and HOMO–LUMO influence PC2 in either active or inactive DPP4 inhibitors. Remarkably, the eight different descriptors are reported with P<0.001 and are significantly different between active and inactive DPP4 inhibitors. These eight different descriptors may represent the informative features that discriminate between active and inactive DPP4 inhibitors. PC3, which is the direction that is orthogonal to both PC1 and PC2, accounted for 13.97% and 14.70% of the total variance for actives and inactives, respectively. Figure S1 indicates that the first three components can preserve 65.26% (active) and 70.58% (inactive) of the original variance. The results indicated that the percentage and cumulative percentage of variance of inactive DPP4 inhibitors remained larger than those of active DPP4 inhibitors. This result is consistent with the observation that the distribution of active DPP4 inhibitors can be further divided into two groups represented by the score plots in Figure 3A. In Table S1 and Figure 3A (top-right), it can be seen that PC3 separates Qm from nCIC and ALogP for active DPP4 inhibitors, whereas in Figure 3A (bottom-right), PC3 separates dipole moment from HOMO and LUMO for inactive DPP4 inhibitors. Table S1 indicates that PC3 highly correlated with nCIC (0.739), ALogP (0.564), and Qm (−0.542) for active DPP4 inhibitors, whereas PC3 highly correlated with dipole moment (−0.634), HOMO (0.698), and LUMO (0.671) for inactive DPP4 inhibitors. nCIC and HOMO were the descriptors with the highest correlation with PC1 for active and inactive DPP4 inhibitors, respectively. Furthermore, Table S2 indicates that the Ncic (30.046) and HOMO (25.521) descriptors highly contribute to PC3 for active and inactive DPP4 inhibitors, respectively. Table S1 indicates that descriptors consisting of Qm, nCIC, ALogP, dipole moment, HOMO, and LUMO influenced PC1 in either active or inactive DPP4 inhibitors. Interestingly, the six differential descriptors are reported with P<0.001 and are significantly different between active and inactive DPP4 inhibitors. These six different descriptors may represent the informative features that discriminate between active and inactive DPP4 inhibitors.

Analysis of active DPP4 inhibitors

Figure 3B indicates that the data points of the scores plots (top-left) of active DPP4 inhibitors can be well discriminated into two subclasses (called active I and active II DPP4 inhibitors). We assumed that the inhibitors in this class may be further separated into subclasses. Thus, in this section, the active DPP4 inhibitors were analyzed according to subclasses. Table 2 indicates that nine descriptors exhibit different patterns between active I and active II DPP4 inhibitors at the level of P<0.001 except for the five descriptors RBN (P=0.593), nCIC (P=0.001), ALogP (P=0.208), TPSA (P=0.026), and Qm (P=0.001). These five descriptors have average values of 5.000±2.240 (RBN), 3.103±0.895 (nCIC), 1.585±1.461 (ALogP), 89.981±26.752 (TPSA), and 0.223±0.034 (Qm) for active I DPP4 inhibitors, whereas active II DPP4 inhibitors have average values of 5.052±1.468 (RBN), 3.263±0.784 (nCIC), 1.463±1.628 (ALogP), 86.867±22.370 (TPSA), and 0.217±0.032 (Qm). In Figure 2B, the histograms of active I and active II DPP4 inhibitors indicated that these five descriptors were not different between the two subclasses. Therefore, except for these five descriptors, the remaining descriptors are significantly different for active I and active II DPP4 inhibitors and are efficient for discrimination.
Table 2

Exploratory data analysis of subclasses of actives (I and II) using the six-term descriptive statistics

StatisticsMWRBNnCICnHDonnHAccALogPTPSAQmEnergyDipole momentHOMOLUMOHOMO–LUMO
Actives I
 Min167.20.0000.0000.0001.000−2.93629.2600.137−0.9080.747−0.572−0.2980.217
 Q1333.53.0003.0002.0005.0000.63773.2500.203−0.2643.998−0.353−0.0360.303
 Median381.05.0003.0003.0007.0001.58785.2500.219−0.1335.675−0.343−0.0230.317
 Mean381.95.0003.1032.6366.9611.58589.9800.224−0.1547.976−0.350−0.0290.320
 Q3429.46.0004.0003.0008.0002.577104.6700.237−0.0417.635−0.331−0.0060.334
 Max753.816.0006.0009.00016.0006.598234.7800.5350.48880.233−0.2860.0470.386
Actives II
 Min202.42.0001.0001.0003.000−2.16347.9500.162−0.6951.035−0.492−0.1550.243
 Q1375.54.0003.0003.0005.0000.35069.4600.200−0.2164.861−0.411−0.1270.293
 Median403.45.0003.0003.0006.0001.50484.6600.210−0.0347.328−0.349−0.0350.305
 Mean407.05.0523.4313.2636.5571.46386.8700.217−0.09319.814−0.367−0.0620.304
 Q3435.46.0004.0004.0008.0002.641101.0400.2240.06833.395−0.332−0.0230.316
 Max658.716.0006.0007.00014.0004.985188.4800.3900.271284.562−0.3070.0370.386

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; Max, maximum; Min, minimum; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; Q1, first quartile; Q3, third quartile; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Figure 3B shows the scores and loadings plots for active I (top-left) and active II (bottom-left) DPP4 inhibitors. It is observed that the distribution of active I and active II DPP4 inhibitors cannot be further divided. The cumulative variances of the first three PCs of active I and active II DPP4 inhibitors were 66.63% and 68.21%, respectively, of the original variation and obtain 80.0% of the original variation performed on the first five PCs. To analyze the highest influence of each descriptor on PC, the loadings and contribution values are used, as shown in Tables S3 and S4, respectively. PC1 highly correlated with MW (0.834), RBN (0.629), nHDon (0.587), nHAcc (0.675), TPSA (0.698), and HOMO–LUMO (−0.565) for active I DPP4 inhibitors, whereas PC1 highly correlated with energy (−0.815), dipole moment (−0.634), HOMO (0.699), LUMO (0.807), and HOMO–LUMO (0.560) for active II DPP4 inhibitors. For PC2, the descriptors energy (−0.743), dipole moment (−0.602), HOMO (0.634), and LUMO (0.702) highly correlated with this component for active I DPP4 inhibitors, whereas MW (0.750), ALogP (−0.551), and TPSA (0.797) highly correlated with this component for active II DPP4 inhibitors. The third PC highly correlated with nCIC (0.734), ALogP (0.619), and Qm (−0.540) for active I DPP4 inhibitors, whereas PC3 highly correlated with nCIC (0.698) for active II DPP4 inhibitors. The descriptors of MW, energy, nCIC, and TPSA provide the absolute highest loadings score values on PC1, PC2, and PC3, respectively, for active I DPP4 inhibitors, whereas the descriptors energy, TPSA, and nCIC provide the absolute highest loadings score values on PC1, PC2, and PC3, respectively, for active II DPP4 inhibitors. These result are consistent with the contribution score of MW (17.639), energy (19.829), and nCIC (27.810), providing the highest values on PC1, PC2, and PC3, respectively, for active I DPP4 inhibitors, whereas the descriptors energy (16.193), TPSA (22.921), and nCIC (24.428) provide the highest PCA loadings score values on PC1, PC2, and PC3, respectively, for active II DPP4 inhibitors, as shown in Table S3.

Prediction and identification of informative molecular descriptors for DPP4 inhibitors

In this study, a QSAR model based on the J48 algorithm is presented for discriminating DPP4 inhibitors as either actives or inactives. Each compound was calculated as an M-dimensional vector where M =13. The encoded compounds from the DPP4-TRN set were then used to construct a QSAR model, which was represented by a DT. To evaluate the internal prediction capacity of our proposed QSAR model on the DPP4-TRN set, two different experiments were performed: one experiment was performed on the full training data and one experiment was evaluated using a tenfold cross validation (CV) procedure as shown in Table 3. The CV procedure was performed by firstly partitioning the data into ten equally-sized segments or folds; then, nine folds were used as the training data while the remaining fold was used for validation. Finally, the results were then averaged across the ten experiments. Four measurements were used to assess the performance of the QSAR models, namely accuracy (Acc), sensitivity (Sen), specificity (Spec), and the Matthews correlation coefficient (MCC). Our proposed QSAR model yielded 96.43% Acc, 98.30% Sen, 94.38% Spec, and 0.929 MCC as performed on the full training data. The prediction results from the tenfold CV procedure were 82.26% Acc, 84.69% Sen, 79.59% Spec, and 0.644 MCC. This result indicated the superiority of the 13 molecular descriptors in predicting DPP4 inhibitors to provide Acc higher than 80.0% and a MCC as high as 0.644.
Table 3

Summary of prediction performance of internal and external sets

DatasetDetailsNAcc (%)Sen (%)Spec (%)MCC
Internal set (DPP4-TRN)Full training1,12296.4398.3094.380.929
Ten-fold CV1,12282.2684.6979.590.644
External set 1 (DPP4-TEST1)External validation14991.28
External set 2 (DPP4-TEST2)External validation16095.63
External set 3 (DPP4-TEST3)External validation16772.25

Note: N is the number of compounds.

Abbreviations: Acc, accuracy; CV, cross-validation; MCC, Matthews correlation coefficient; Sen, sensitivity; Spec, specificity.

Identification of informative molecular descriptors provided a better understanding of the different characteristics between active and inactive DPP4 inhibitors. After construction of the DT, the informative molecular descriptor could be identified using the feature usage score. A molecular descriptor having the highest feature usage is the most important feature because it contributes the most to prediction performances. Figure 4 shows the feature usage of each descriptor or descriptor usage by using the J48 algorithm on DPP4-TRN.34 The top five informative molecular descriptors having a descriptor usage score larger than 30 were MW, LUMO, nHDon, nHAcc, and ALogP. Interestingly, for the five top-ranked and informative molecular descriptors, the distributions of active and inactive DPP4 inhibitors were significantly different, with P<0.001, as shown in Table 1. Furthermore, the three external validation sets were used for evaluating the robustness and generalization ability of the proposed QSAR model established from the DPP4-TRN. Figure S2 shows the overview of Tanimoto coefficient for the four dataset as a heatmap. For example, the top-right panel shows the heatmap of DPP4-TRN versus DPP4-TEST2. Prediction results for QSAR model of DPP4-TEST1, DPP4-TEST2, and DPP4-TEST3 achieved test accuracies of 91.28%, 95.63%, and 72.25%, respectively. Based on our results, it could be concluded that our proposed QSAR model was efficient in prediction of DPP4 inhibitors into either actives or inactives and filtration of inactive DPP4 inhibitors from active DPP4 inhibitors.
Figure 4

Plot of the descriptor usage derived from the J48 algorithm.

Note: The descriptor with the largest descriptor usage value is the most important.

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Chemical substructure analysis of active and inactive DPP4 inhibitors is an effective approach to identify important chemical fragments that may govern the biological activity toward the DPP4 enzyme. Tables 4 and 5 summarize the top ten fragments of the active and inactive inhibitor classes, respectively. The top ten fragments of active inhibitors indicated that pyrrolidine-based, thiazolidine-based, amino amide-based, pyridine-based, piperazine-based, and aromatic-based fragments are essential for DPP4 inhibition. The fragment 1-ethyl-2-fluorobenzene ranked first (617 counts), followed by 2-amino-1-(pyrrolidin-1-yl)ethan-1-one (597 counts). The occurrence of these top two fragments is clearly greater than that of the remaining fragments, as indicated by the fragment counts (Table 4), which indicate their important roles in DPP4 inhibition.
Table 4

Summary of the top ten fragments in the active set of DPP4 inhibitors

RankIUPAC nameStructureFragment count
11-ethyl-2-fluorobenzene 617
22-amino-1-(pyrrolidin-1-yl)ethan-1-one 597
31-(1,3-thiazolidin-3-yl)propan-1-one 136
41-(pyrrolidin-1-yl)propan-1-one 101
5Propylbenzene 52
62-amino-N-methylpentanamide 50
72,3,6-trimethylpyridine 45
8(1-formylpyrrolidin-2-yl)boronic acid 43
91-chloro-2-ethenylbenzene 36
104-(1-ethylhydrazin-1-yl)-1-methylpiperazine 32

Abbreviation: IUPAC, International Union of Pure and Applied Chemistry.

Table 5

Summary of the top ten fragments in the inactive set of DPP4 inhibitors

RankIUPAC nameStructureFragment count
1Benzyl(ethyl)amine 102
22-methyl-2,3-dihydro-1H-isoindole 77
31-(pyrrolidin-1-yl)propan-1-one 64
41-(piperidin-1-yl)ethan-1-one 45
5Propylbenzene 35
63-ethyl-4-methylpyrrolidin-2-one 19
72-amino-1-(pyrrolidin-1-yl)ethan-1-one 17
83-ethyl-2,4-dimethylpyridine 14
9N-ethylcyclohexanamine 14
10(Pyrrolidin-2-yl)phosphonic acid 13
Because DPP4 prefers substrates containing proline or alanine at position 2 of the N-terminus, many inhibitors have been designed based on the peptidomimetic concept.12 These peptidomimetic inhibitors are categorized as glycine-based and β-alanine-based types.12 Pyrrolidine has been used as a core structure in the design of both inhibitor types with respect to its functional groups that play crucial roles for interaction at the active site of the enzyme. The DPP4 inhibitory activities of these inhibitors are similarly accomplished by hydrophobic and van der Waals interactions,39 as well as hydrogen-bond and salt-bridge formation.12 The active site of DPP4 consists of a catalytic triad (Ser630, H740, and D708), oxyanion hole, and specific residues, ie, S1 and S2 pockets.12,39 All known DPP4 inhibitors have been reported to occupy these pockets for inhibition.39 The most frequently found fragment is 1-ethyl-2-fluorobenzene, which is a highly lipophilic aromatic-based fragment. The aryl substitution on the C-4 position of the pyrrolidine ring has been noted to improve the stability and duration of DPP4 inhibitors.40 In addition, the fluorine substituent on the C-4 position of the pyrrolidine ring has been reported to provide good inhibitory properties, selectivity, and pharmacokinetic profiles.41 The preferable pharmacokinetic profile may result from a lipophilic property governed by a planar aromatic ring and halogen atoms, which facilitates cell entry to the target site of action. In this study, a similar aromatic-based fragment containing a halogen atom, ie, 1-chloro-2-ethenylbenzene, was found as the ninth-ranked fragment. Additional aromatic-based fragments were also ranked as top ten fragments, such as propylbenzene and 2,3,6-trimethylpyridine. It could be hypothesized that the flexibility of the rotatable alkyl chain in the propylbenzene fragment may facilitate cell penetration and hydrophobic interactions at the active site, and the nitrogen atom in the pyridine ring of 2,3,6-trimethylpyridine may play a role in H-bond formation in the DPP4 active site. The pyrrolidine amide is considered a key moiety in the design of DPP4 inhibitors.12 Most of the potent inhibitors have been developed by substitution of the amide moiety of this core structure with an electrophile42–44 that forms a covalent adduct with Ser630 of the DPP4 active site.12 Therefore, it is not surprising that among the top ten fragments, the pyrrolidine-based fragments, ie, 2-amino-1-(pyrrolidin-1-yl)ethan-1-one, 1-(pyrrolidin-1-yl)propan-1-one, and (1-formylpyrrolidin-2-yl)boronic acid, appear to be the most frequently occurring fragments. Notably, the 2-amino-1-(pyrrolidin-1-yl)ethan-1-one fragment, which is presented in many compounds, has been used as a prototype for structural modification.12 All of these fragments are amide derivatives of pyrrolidine. It is possible that the oxygen atom of the amide functional group may be essential for H-bond formation with the DPP4 active site.12 In addition, the amine group has been noted for its role in forming a salt-bridge with Glu205 and/or Glu206 of DPP4.12 Moreover, the boronic acid derivative of pyrrolidine amide, (1-formylpyrrolidin-2-yl)boronic acid, ranked eighth. This finding supported the fact that substitution of boronic acid at the 2-position of the pyrrolidine ring is effective for DPP4 inhibition, as observed from the progress of talabostat into Phase III clinical trials.12,44 The thiazolidine derivative fragment, 1-(1,3-thiazolidin-3-yl)propan-1-one, ranked third. Clearly, the shape of this fragment is similar to that of pyrrolidine amide derivatives (ie, 2-amino-1-(pyrrolidin-1-yl)ethan-1-one and 1-(pyrrolidin-1-yl)propan-1-one) except for the presence of a sulfur atom in the five-membered ring. The thiazolidine analog of pyrrolidine-based compounds has been noted for its stability, potency, selectivity, and oral bioavailability.45–47 The amide-based fragment, ie, 2-amino-N-methylpentanamide, was found to be the sixth-ranked fragment. The X-ray crystal structure indicated that the amide moiety is essential for a key interaction in DPP4 inhibition.12 The amino group (−NH2) forms a salt-bridge with Glu205, and the O atom of the carbonyl group (−C=O) forms an H-bond with Arg125 in the DPP4 active site.12 In addition, the piperazine-based fragment, ie, 4-(1-ethylhydrazin-1-yl)-1-methylpiperazine, was found as the tenth-ranked fragment. DPP4 inhibitors containing a piperazine substituent have been reported to exhibit high potency.48 Notably, some fragments of active inhibitors, ie, 2-amino-1-(pyrrolidin-1-yl)ethan-1-one, 1-(pyrrolidin-1-yl)propan-1-one, and propylbenzene, were also found in the top ten fragments of the inactive inhibitor class. This finding may indicate that the inhibitory activities of DPP4 inhibitors are influenced by additional factors. The results of the inactive inhibitors (Table 5) indicated that the type and position of the substituents, type of functional groups, appropriate size, and arrangement of substructures may be crucial for DPP4 inhibition. For example, the effect of the position of substituents and the length of the alkyl chain were found when comparing 2,3,6-trimethylpyridine (active) and 3-ethyl-2,4-dimethylpyridine (inactive).

Scaffold analysis

Analysis of the molecular scaffold of DPP4 inhibitor was performed in order to discern important core structures giving rise to their bioactivity. Datasets of both active and inactive DPP4 inhibitors were subjected to molecular scaffold analysis using the Bemis–Murcko framework clustering method as implemented by JKlustor version 0.07.49 In brief, this clustering method initially generates molecular frameworks representing molecular scaffolds as derived from compounds in datasets by removing side chain atoms from the main structures and finally presenting them in the form of a molecular graph, which is subsequently clustered based on the Bemis–Murcko framework algorithm.50 A total of 332 and 152 scaffolds were obtained for actives and inactives, respectively. The large number of molecular scaffolds that were obtained is indicative of the higher diversity of molecular patterns presented in the dataset. Herein, this result suggests that molecular patterns in active DPP4 inhibitors are more diverse than their inactive counterpart. Further in-depth analysis of scaffolds from both active and inactive classes was performed by comparing members of each molecular scaffold from both classes. It was found that there were no significant differences in the molecular frameworks for both classes as can be seen in Tables S5 and S6 and Figure 5. This suggested that the important structures responsible for the bioactivity were functional groups as well as substructures of molecules.
Figure 5

Summary of top 20 molecular frameworks for actives (1a–20a) and inactives (1b–20b).

In order to elucidate such important substructures, Klekota–Roth fingerprints consisting of 4,860 descriptors were generated by the PaDEL-Descriptor software on DPP4-TRN.51,52 Consequently, a mean decrease of the Gini index (MDGI) as derived from random forest53,54 was used as the basis for selecting the most important feature from the initial set of 4,860 descriptors. The descriptor having the highest MDGI value was deemed to be the most important feature because it affords the most influence to the prediction performance. The set of 30 top-ranked fingerprints having the largest MDGI values are summarized in Figure S3 and Table S7. It can be seen that the most important structural fingerprint is piperazine-1-carbaldehyde (KRFP4541) with a MDGI value as high as 9.197. Meanwhile, the second most important structural fingerprint with a MDGI value of 3.610 is the piperazine ring (KRFP2428). Interestingly, the significance of piperazine is supported by the fact that it is an important structural part of oral antihyperglycemic agents called gliptins, which target DPP4 receptors and have been approved by the US Food and Drug Administration (FDA) for use in T2D treatment. Particularly, sitagliptin and teneligliptin, which are piperazine containing gliptins, have shown an additional mode of binding with the DPP4 receptor. In brief, the DPP4 inhibitors can be categorized into three classes according to their binding subsites.55 Class I DPP4 inhibitors (ie, vildagliptin and saxagliptin) employed cyanopyrrolidine and hydroxyl adamantyl moieties to bind to S1 and S2 subsites of the DPP4 active site, respectively. In addition to the binding mode of class I, class II DPP4 inhibitors (ie, two recently released DPP4 inhibitors alogliptin and linagliptin) can further engage in π–π interaction with S’1 and S’2 subsites. As for class III DPP4 inhibitors (ie, sitagliptin and teneligliptin), the presence of the piperazine ring at the P2 position engages in interaction with the S2 extensive subsite and introduces the “anchor lock domain” resulting in an increase of the binding activity owing to the stronger hydrophobic interactions mediated by this domain.56–59 In addition, results of contact area calculation of this domain also revealed correlation between the binding surface and the inhibitory activity against DPP4 receptor, further emphasizing the importance of this domain.55 Nevertheless, the role of piperazine derivatives in DPP4 inhibitory activity is not only found in these two drugs but is also reported in various DPP4 inhibitors that are under active development.60–62

Binding mode of DPP4 inhibitors

Molecular docking and subsequent post-docking analyses using the SiMMap server identified the common binding mode of DPP4 inhibitors as well as key interactions with the enzyme. The SiMMap server provided a site-moiety map of the binding pocket along with details on conserved interacting residues, moiety preferences, and interaction types.37 Analyses based on 100 active DPP4 inhibitors revealed three different binding anchors (HB1, HB2, and vdW) and their moiety preferences (Figure 6). The anchor HB1 comprised side chains of Arg125, Glu205, Glu206, and Tyr662 while anchor HB2 contained only the hydroxyl side chain of Tyr547. Both anchors were found to make hydrogen bonds with several nitrogen functional groups (ie, amine-, amide-, imine-, and nitrile-based) as well as ketone-based moieties of the inhibitors. In contrast, the anchor vdW consisted primarily of hydrophobic side chains of Tyr547, Tyr631, Trp659, Tyr662, and Tyr666 as well as the hydroxyl group of the catalytic residue Ser630. This pocket formed van der Waals contacts with aromatic, heterocyclic, and aliphatic moieties of DPP4 inhibitors. It should be noted that from our SiMMap analyses, the anchor HB1 has been known as the S2 pocket, which is involved in key salt bridge interactions of either the free amino terminus of a peptide substrate or the cationic groups of an inhibitor with the carboxylate side chains of Glu205 (and/or Glu206) as well as the guanidinium side chain of Arg125, which also helps stabilize either the amide carbonyl group of a substrate or the ketone moiety of an inhibitor.7,12 The anchor vdW corresponds to the S1 selectivity pocket of the enzyme that has been shown to be occupied with specific benzene- and pyrrolidine-based moieties of the DPP4 inhibitors.7,12
Figure 6

Three different binding modes of interaction of DPP4 inhibitors in the active site of the enzyme.

Notes: The identified anchors HB1, HB2, and vdW from the SiMMap server are labeled and shown in cyan and yellow spheres, respectively. Docking poses of two selected inhibitors are visualized herein: the compound with the best SiMMap score (A) and the compound with the lowest half maximal inhibitory concentration values (B). Residues at the active site are shown in green sticks while key interacting residues are labeled and shown in dark grey lines.

It should be noted that at least the first five DPP4 inhibitors with the best SiMMap score contained the amine-, amide-, and aromatic moieties for making interactions with all three different binding anchors (HB1, HB2, and vdW) of the enzyme. These findings suggested the significance of moiety preferences of inhibitors for binding and inhibiting DPP4 as well as serve as a general guideline for the design of novel inhibitors towards DPP4.

Comparison with FDA-approved drugs

In order to investigate the similarity between compounds investigated herein with those of FDA-approved DPP4 inhibitors, Tanimoto coefficient was computed for each compound in the dataset as well as six FDA-approved DPP4 inhibitors (ie, sitagliptin, vildagliptin, saxagliptin, alogliptin, linagliptin, and teneligliptin). The Tanimoto coefficient is a well-known metric for assessing the pairwise similarity between two molecules in which higher score represents high similarity. Results revealed that four of six DPP4 inhibitors (ie, sitagliptin, vildagliptin, saxagliptin, and linagliptin) were included in our curated dataset as observed from a Tanimoto coefficient of 1.000. The closest analog in our dataset to alogliptin and teneligliptin had Tanimoto coefficients of 0.819 and 0.602, respectively. Manual inspection of the pairwise Tanimoto coefficients between each compound of the dataset and the six FDA-approved drugs revealed that there were indeed several analogs of FDA-approved drugs present in the dataset. Such presence of analogs of FDA-approved drugs may densely populate the dataset and possibly mask the effect of less densely populated compounds. Concomitant with this issue is the observed imbalance in size of actives and inactives. Particularly, the rather small size of inactives may arise from the possibility that poor results for DPP4 inhibitory assays may not be published as often and therefore may contribute to the lower number of inactives. As fuzzy C-means clustering was applied in sampling the dataset for QSAR modeling, such aforementioned chemical space bias would not exert its influence on the constructed QSAR models. A further look at the bioactivity of compounds exhibiting Tanimoto coefficient ≥0.5 to FDA-approved DPP4 inhibitors was performed. It was observed that the number of highly similar compounds with sitagliptin, vildagliptin, saxagliptin, alogliptin, linagliptin, and teneligliptin were 131, 273, 266, 87, 76, and 60 compounds, respectively. Of these compounds, a total of 130, 214, 192, 86, 76, and 60 compounds were classified as actives (IC50 less than 1 μM) for sitagliptin, vildagliptin, saxagliptin, alogliptin, linagliptin, and teneligliptin, respectively. Interestingly, a total of 59 and 74 compounds exhibiting high similarity with vildagliptin and saxagliptin, respectively, were classified as inactive. The R-group analysis of pyrrolidine as privileged structure of these molecules revealed pertinent insight of important substituent at positions 1, 2, and/or 5 on this ring. Alkyl group connected with nitrogen atom at position 1 seemed to be an important position since many structural modifications were observed at this position, which is followed by positions 2 and/or 5 where active moiety is usually nitrile. Herein, functional group and molecular fragment modifications based on commercially available DPP4 inhibitors could be a potent initial structure for further improving its bioactivity. Nevertheless, the agreement of binding mode to DPP4 receptor of any modified structures should be considered at the same time in order to abstain from steric effects that could lead to lowered bioactivity. Furthermore, the Lipinski’s rule of five was applied to the compiled compounds from all datasets and results are summarized in Table S8. Interestingly, it can be seen that compounds belonging to the internal set (DPP4-TRN) along with the external set (DPP4-TEST3) afforded roughly similar percentages of compounds passing the rule of five at approximately 90%, while DPP4-TEST1 and DPP4-TEST2 afforded close to 70%. The former sets contained primarily proteins belonging to the DPP family while the latter sets represent random proteins and proteases. Furthermore, actives (~94%) from DPP4-TRN provided higher percentages than their inactive counterpart (~84%–89%).

Limitations

In exploring the chemical space of DPP4 inhibitors through various means, an issue arises pertaining to the possibility of chemical space bias that may be inherently present in the compiled datasets. It should be noted that compounds were derived from the BindingDB, and although it is assumed to house nearly all (if not all) bioactivity data of DPP4, there is a possibility that some negative results for investigated compound series against DPP4 may not be published, while those that are published are those reporting favorable results for compounds affording nanomolar potency or those that further optimize lead compounds undergoing clinical trials. Bias may arise from medicinal chemists who may have inherent preference for certain chemical scaffolds, which could be attributed to the existence of common chemistry or the use of known fragments commonly found in drugs called privileged structures.63 Thus, great caution should be taken in evaluating the essential functionality giving rise to potent bioactivity.

Conclusion

The search for novel antidiabetic agents has become increasingly important in drug design and development in light of the continual increase in the prevalence of diabetes worldwide. The inhibition of DPP4 is one strategy to combat diabetes. This study reports the large-scale chemical space exploration and QSAR investigation of DPP4 inhibitors. The QSAR model constructed by 13 descriptors provided good predictive performance as represented by an Acc close to 83.0% and a MCC as high as 0.644 for tenfold CV. In addition, a set of descriptors was identified as informative features influencing the predictive performance. The univariate analysis revealed the inherent physicochemical properties and important substructures governing inhibitory activity. The active inhibitors were found to be larger and more charged, polar, flexible, and stable than the inactive inhibitors. Furthermore, the chemical substructure analysis suggested that highly lipophilic aromatic-based and pyrrolidine-based fragments may be essential for DPP4 inhibition. Furthermore, the scaffold analysis revealed piperazine to be a privileged structure affording DPP4 inhibitory activity. Finally, our findings may provide a deeper understanding and pertinent knowledge for the design and development of DPP4 inhibitors. Cumulative variance from PCA analysis of active and inactive DPP4 inhibitors. Abbreviation: PCA, principal component analysis. Heatmap of Tanimoto coefficient on five DPP4 datasets consisting of one internal set and four external validation sets. Tanimoto coefficient varies between 0 (total lack of similarity) to 1 (a compound has an identical constitution to a reference). Important fingerprints of DPP4 inhibitors as ranked by the MDGI. The fingerprint with the largest MDGI value is deemed to be the most important. Abbreviation: MDGI, mean decrease of Gini index. PCA loadings score for active and inactive DPP4 inhibitors Notes: The bold values represent the highest loadings scores at the current PC, compared to other PCs. For instance, MW has a higher loading score of 0.849 at PC1, compared to PC2 (0.195), and PC3 (0.402). Abbreviations: ALogP, Ghose–Crippen octanolwater partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PCA, principal component analysis; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area. Contribution value of each descriptor to principal component for active and inactive DPP4 inhibitors Note: The bold values show the highest loadings scores at the current PC, compared to other PCs. Abbreviations: ALogP, Ghose–Crippen octanolwater partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PC, principal component; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area. PCA loadings score for active I and active II DPP4 inhibitors Note: The bold values show the highest loadings scores at the current PC, compared to other PCs. Abbreviations: ALogP, Ghose–Crippen octanolwater partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PCA, principal component analysis; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area. Contribution value of each descriptor to principal components for active I and active II DPP4 inhibitors Note: The bold values show the highest loadings scores at the current PC, compared to other PCs. Abbreviations: ALogP, Ghose–Crippen octanolwater partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PC, principal component; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area. Summary of molecular framework generated from active DPP4 inhibitors Abbreviation: SMILEs, simplified molecular-input line-entry system. Summary of molecular framework generated from inactive DPP4 inhibitors Abbreviation: SMILEs, simplified molecular-input line-entry system. Summary of important structural fingerprints ranked by the MDGI Abbreviation: MDGI, mean decrease of Gini index. Applying Lipinski’s rule of five on investigated data sets Note: Values shown are for compounds passing the Lipinski’s rule of five/in relation to the total number of compounds (values in parentheses are percentages passing the Lipinski’s rule of five).
Table S1

PCA loadings score for active and inactive DPP4 inhibitors

DescriptorActive
Inactive
PC1PC2PC3PC1PC2PC3
MW0.8490.1950.4020.849−0.3760.224
RBN0.6380.431−0.0250.664−0.0170.476
nCIC0.297−0.2410.7390.4890.6510.077
nHDon0.578−0.231−0.2040.3410.6050.289
nHAcc0.6370.5270.0130.8930.1600.112
ALogP0.0610.2670.5640.1640.8130.203
TPSA0.6930.268−0.2390.8150.4050.094
Qm0.4450.4100.5420.6010.483−0.144
Energy−0.2420.7650.1020.654−0.264−0.234
Dipole moment0.3610.617−0.2610.4390.0910.634
HOMO−0.3610.6370.380−0.211−0.3520.698
LUMO−0.5850.7320.073−0.5590.3440.671
HOMO–LUMO0.5750.452−0.413−0.4630.6080.258

Notes: The bold values represent the highest loadings scores at the current PC, compared to other PCs. For instance, MW has a higher loading score of 0.849 at PC1, compared to PC2 (0.195), and PC3 (0.402).

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PCA, principal component analysis; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Table S2

Contribution value of each descriptor to principal component for active and inactive DPP4 inhibitors

DescriptorActive
Inactive
PC1PC2PC3PC1PC2PC3
MW19.8511.2578.88015.8205.2362.627
RBN11.2026.1060.0359.6810.01111.868
nCIC2.4211.91430.0465.24615.6720.308
nHDon9.2141.7512.3022.54413.5234.380
nHAcc11.1919.1590.00917.4990.9470.655
ALogP0.1012.34117.5130.58924.4412.160
TPSA(Tot)13.2352.3713.13914.5586.0690.466
Qm5.4615.52616.1467.9338.6331.091
Energy1.61219.2540.5699.3822.5792.876
Dipole3.59412.5403.7574.2200.30621.012
HOMO3.58513.3787.9450.9744.56525.521
LUMO9.41917.6640.2916.8514.38023.563
HOMO–LUMO9.1146.7389.3694.70113.6393.472

Note: The bold values show the highest loadings scores at the current PC, compared to other PCs.

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PC, principal component; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Table S3

PCA loadings score for active I and active II DPP4 inhibitors

DescriptorActive I
Active II
PC1PC2PC3PC4PC5PC1PC2PC3PC4PC5
MW0.8340.2410.412−0.002−0.0470.3380.750−0.0260.4580.154
RBN0.6290.473−0.001−0.1660.3190.4100.344−0.4930.266−0.192
nCIC0.281−0.2170.7340.332−0.267−0.1010.5010.6980.382−0.066
nHDon0.587−0.148−0.2130.1130.492−0.3880.506−0.010−0.2570.673
nHAcc0.6750.4630.007−0.111−0.4870.7960.253−0.3270.2030.002
ALogP0.1120.2300.619−0.5810.3290.4560.551−0.0670.3670.332
TPSA0.6980.257−0.2570.4880.0550.3090.7970.063−0.322−0.089
Qm0.4480.3830.5400.2430.0230.4490.460−0.4390.472−0.097
Energy−0.2840.7430.1040.3890.1840.8150.1830.408−0.076−0.174
Dipole moment0.5150.602−0.285−0.235−0.1010.6340.035−0.4430.182−0.196
HOMO−0.4280.6340.3620.4270.1360.6990.0780.594−0.073−0.073
LUMO−0.6490.7020.0140.1690.0450.807−0.1550.488−0.146−0.046
HOMO–LUMO0.5650.378−0.469−0.284−0.1050.560−0.555−0.022−0.2150.037

Note: The bold values show the highest loadings scores at the current PC, compared to other PCs.

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PCA, principal component analysis; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Table S4

Contribution value of each descriptor to principal components for active I and active II DPP4 inhibitors

DescriptorActive I
Active II
PC1PC2PC3PC4PC5PC1PC2PC3PC4PC5
MW17.6392.0808.7430.0000.2592.78720.2570.03518.9593.269
RBN10.0468.0420.0002.12112.1184.1004.25712.1986.4115.086
nCIC2.0001.68827.8108.5078.4430.2479.04924.42813.1570.612
nHDon8.7490.7902.3520.98928.7253.6649.2190.0055.94562.621
nHAcc11.5737.7110.0030.94528.22615.4642.3105.3553.7230.000
ALogP0.3161.89819.74826.04412.8645.07010.9260.22512.18515.284
TPSA12.3702.3693.40818.3850.3562.33622.9210.2019.3841.108
Qm5.0915.27815.0704.5400.0654.9107.6369.65120.1541.291
Energy2.04519.8290.56311.6884.00316.1931.2028.3590.5184.212
Dipole moment6.73113.0514.2004.2711.2039.8000.0449.8682.9815.291
HOMO4.64814.4346.76114.0852.18911.9060.22017.6900.4770.737
LUMO10.69017.6980.0102.1900.24315.8750.86311.9621.9290.296
HOMO–LUMO8.1035.13111.3326.2341.3087.64711.0960.0234.1790.193

Note: The bold values show the highest loadings scores at the current PC, compared to other PCs.

Abbreviations: ALogP, Ghose–Crippen octanol–water partition coefficient; HOMO, highest occupied molecular orbital; HOMO–LUMO, energy gap between the HOMO and LUMO states; LUMO, lowest unoccupied molecular orbital; MW, molecular weight; nCIC, number of rings; nHAcc, number of hydrogen bond acceptors; nHDon, number of hydrogen bond donors; PC, principal component; Qm, mean absolute charge; RBN, rotatable bond number; TPSA, topological polar surface area.

Table S5

Summary of molecular framework generated from active DPP4 inhibitors

NumberSMILESMember size
1C(CCC1CCC2CCCC2C1)CC1CCCCC193
2C(CCC1CCCCC1)CC1CCCC1CCCC1CCCCC187
3C(CC1CCCC1)CC1CCC(CC1)C1CCCCC185
4C1CCCC165
5C(CC1CCC(CCC2CCCCC2)CC1)C1CCCC150
6C(C1CCCC1)C1CCCC143
7C(CCCC1CCCCC1)CCC1CCCC137
8C(C1CCCC1)C1CCC(C1)C1CCC(CC1)C1CCCCC133
9C(C1CCCC1)C1CCC(CC2CCCCC2)C132
10C(CC1CCCC1)CC1CCC(C1)C1CCCCC132
11C(C1CCCC1)C1CCC(CC2CCC(CC2)C2CCCC2)C131
12C(CCC1CCCCCC1)CC1CCCCC131
13C(C1CCCCC1)C1CCCCC1C1CCCCC130
14C(CC1C2CC(CC12)C1CCCCC1)CC1CCCC129
15C1C2CCCCC2C2CCC(CC12)C1CCCCC127
16C(CC1CCCC1)CC1CCCCC127
17C(CCC1CCC(CC2CCCCC2)CC1)CC1CCCCC127
18C(CCC1CCCCC1)CC1CCCC125
19C(CC1CCC(CC1)C1CCCCC1)C1CCCCC124
20C(CCCC1CC1C1CCCCC1)CCCC1CCCC124
21C(C1CCCC1)C1CCC(C1)C1CCCCC123
22C1CCC(CC1)C1CCC2C(CCC3CCCCC23)C122
23C(CCCC1CCCC1)CCCC1CCCCC122
24C1CCC(CC1)C1CCCC2CCCCC1221
25C1CCC(CC1)C1CCCCC121
26C(CC1CCCC1)CC1CCC(CC1)C1CCC2CCCC2C120
27C1CCC(CC1)C1CCCC(C1)C1CCCCC120
28C(C1C2CCCCC2CC1C1CCCCC1)C1CCCCC119
29C(CC1CCCC1)CC1CC2CCCC2C118
30C(C1CCC2CC(CC2C1)C1CCCCC1)C1CCCC2CCCCC1218
31C1CC2CCC(CC2C1)C1CCCCC118
32C(CC1CCCC1)CC12CC3CC(CC(C3)C1)C217
33C(CC1CCCC1)CC1CCC(CCC2CCCCC2)CC117
34C(CC1CCCCC1)C1CCC(CC2CCCC2)C117
35C(CCC1CCC2CCC(C2C1)C1CCCCC1)CC1CCCCC116
36C(CC1CCC2CC(CC2C1)C1CCCCC1)C1CCCCC116
37C(CC1CCC(CCC2CCCCC2)CC1)C1CCCCC116
38C(C1CCCC1)C1CCC(CC2CC3CCCCC3C2)C115
39C(CC1CCCC1)CC1CCC(CCCC2CCCCC2)CC115
40C(CCC1CCCCC1)CC1CCCC1CCC1CCCCC115
41C(CC1CCCCC1)C1CCCCC114
42C(CC1CCCC1)CC1CCCC114
43C(CCC1CCCCC1)CCC1(CCCCC1)C1CC2CCCCC2C114
44C(C1CCCCC1)C1CC2CCCCC2CC1C1CCCCC113
45C(CC1CCC(CCC2CCCCC2)C1)C1CCCC113
46C(CCC1CCCC(CC2CCCCC2)CC1)CC1CCCCC113
47C(CCC1CCCCC1)CC1CCCC1C1CCC(C1)C1CCCCC113
48C(CCC1CCC2CCCC2C1CC1CCCCC1)CC1CCCCC113
49C1CCC(CC1)C1CCC(CC1)C1CCCCC113
50C(CC1CCCCC1)C1CCCC112
51C(CCC1CCCC1)CCC1CCC(C1)C1CCCCC112
52C(C1CCCC1)C1CCC(C1)C1CCC(CC1)C1CC2CCCCC2C111
53C(CCC1CCCCC1)CC1CCCCC111
54C1CC2CCC(CC2C1)C1CCC(CC1)C1CCCCC111
55C(CCC1CCC2CC(CC2C1)C1CCCCC1)CC1CCCCC111
56C(CCCCC1CCCCC1)CCCC1CCCC111
57C(C(CC1CCCCC1)C1CCCC1CC1CCCC1)C1CCCCC110
58C(C1CCC2CC(CC2C1)C1CCCCC1)C1CCC2CCCCC2C110
59C(CCC1CCC(CCC2CCCC2)CC1)CC1CCCCC110
60C(CCCC1CCCC1)CCC1CCCC110
61C(CCC1CCCC1)CCC1CCC2CCCC2C110
62C(CCC1CCCCC1CC1CCCCC1)CC1CCCCC110
63C(CCCC1CCCC1)CCCC1CCC2CCCC2C110
64C1CCC(CC1)C1CCC(CC1)C1CCC2CCCCC2C19
65C(CCCCC1CCCC1)CCCCC1CCCCC19
66C(CCC1CCCC1)CCC1CCCC19
67C(CC1CCC(CCC2CCCC2)C1)CC1CCCCC18
68C(C1CCCC1)C1CCC(CC2CCC3CCCCC3C2)C18
69C(CC1CCCC1)CC1CCC(CC1)C1CCCC(C1)C1CCCC18
70C(CCC1CCCCC1)CC1CCCC1CCCC1CC18
71C(C1CCCC1)C1CCC(C1)C1CCC(CC1)C1CCCC2CCCCC128
72C(CC1CCCCC1C1CCCCC1)C1CCCC18
73C(CCCC1CCC2CCCCC2C1)CCC1CCCC18
74C(CCC1CCC(CC1)C1CCCCC1)CC1CCCCC18
75C(C1CCCCC1)C1CC2CCCC2CC1C1CCCCC18
76C(CC1CCCC1)CC1CCC(CC2CCCCC2)CC18
77C1CCC(C1)C1CCC2C(CCC3CCCCC23)C17
78C(CCC1CCCC1)CCC1CCCCC17
79C(C1CCC2CC(CC2C1)C1CCCCCC1)C1CCC2CCCCC2C17
80C(CCC1CCCCCC1CC1CCCCC1)CC1CCCCC17
81C(CCC1CCC2CCCCC2C1)CC1CCCC17
82C(CC1CCCCC1)CC1CCCCC1C1CCCCC17
83C(C1CCCC1)C1CCC(C1)C1CCC2CCCCC127
84C1CC(CC1C1CCCCC1)C1CCCC(C1)C1CCCCC16
85C(CCC1CCC(CCCC2CCCCC2)CC1CC1CCCCC1)CC1CCCCC16
86C(CC1CCCCC1)CC1CCCC(CCC2CCCC2)C16
87C(CCC1CCCCC1)CC1CCCC1C1CCC(C1)C1CC16
88C(C1CCC2CC(CC2C1)C1CCCCCC1)C1CCCC2CCCCC126
89C(CC1CCCC1CC1CCCC1)CC12CC3CC(CC(C3)C1)C26
90C(CC1CCCC1)C(CC1CC1)C1CCC(C1)C1CCCCC16
91C(CCC1CCC2CCC(C3CCCC3)C2C1)CC1CCCCC15
92C(CCC1CCC2CC(CC3CCCCC3)CC2C1)CC1CCCCC15
93C1CCC(C1)C1CCCC(C1)C1CCC(C1)C1CCCCC15
94C1CC2CCCC2C15
95C(CC1CCC(C1)C1CCCCC1)CC1CCCC2CCCCC125
96C1C(CC2CCCCC12)C1CCCCC15
97C(CCC1CCC2CC(CC2C1)C1CC1)CC1CCCCC15
98C(CCC1CCCCC1)CC1CCCC1CC1CCCCC14
99C(CC1CCCC1)CC1CCC(CC1)C1CCCC2CCCC124
100C(CC1CCC(CC2CCCC2)C1)CC1CCCCC14
101C1CC1C1CCCC2CCC(CC12)C1CCC(C1)C1CCCCC14
102C(CCC1CCC2CCCC2C1CC1CCCC1)CC1CCCCC14
103C(CC1CCC1)CC1CCC(CC1)C1CCCCC14
104C(C1CCCCC1)C1CC(CCC1C1CCCCC1)C1CCCC14
105C(CC1CCC(CC2CCCCC2)CC1)C1CCCCC14
106C(CC1CCCC1)CC1C2CC3CC(C2)CC1C34
107C(C1CCCCC1)C1CC(CCC1C1CCCCC1)C1CCCCC14
108C(CCC1CCC2C(CCC2C2CCCCC2)C1)CC1CCCCC14
109C(CCC1CCC(CC2CC3CCCCC3C2)CC1)CC1CCCCC14
110C(CCCC1CCCCC1)CCC1CCCCC14
111C(CC1CCC2CC(C(CC3CCCCC3)C2C1)C1CCCCC1)C1CCCCC14
112C1CCC(CC1)C1CCC2CCCC(C3CCCCC3)C2C14
113C(CC1CCC(CCC2CCCC2)C1)CC1CCCC2CCCCC124
114C(CC1CCCCC1)CC1CCC(CCC2CCCC2)CC14
115C(CCC1CCCC1)CCC1CC2CCCCC2C14
116C(CC1CCC(C(CC2CCCCC2)C1)C1CCCCC1)C1CCCCC13
117C(CCC1CC2CCCCC2C1)CC1CCCC13
118C(C1CCC1)C1CCC(C1)C1CCC(CC1)C1CCCCC13
119C(CCC1CCC(CC2CCC3CCCCC3C2)CC1)CC1CCCCC13
120C1CC2CC(CC2C1)C1CCCCC13
121C(CCC1CCCC1)CCC1CCC2CCCCC2C13
122C(CCC1CCCCC1)CCC1(CCCCC1)C1CCCCC13
123C(CCC1CCCCC1)CC1CCCC1C1CCCC13
124C(C1CCCCC1)C1CCC2CC(CC2C1)C1CCCCC13
125C(CC1CCCCC1)CC1CCC(CC1)C1CCCCC13
126C(CCC1CCCCC1)CC1CCCC1C1CCC(C1)C1CCC13
127C(CC1CCCC1)CC12CC3CC(C1)CC(CCCC1CCCCC1)(C3)C23
128C(CC1CCCC1)C1CCCC13
129C(CC1CC1)CC1CCCC13
130C(CC1CCC(CCC2CCCC3CCCCC23)CC1)C1CCCC13
131C(CC1CCC(CCC2CCC3CCCCC3C2)CC1)C1CCCC13
132C3
133C(CC1CCC1)CC1CCCC13
134C(CC1CCCC1)CC1CCCCCCC13
135C(CCCC1CC2CCCCC2C1)CCC1CCCC13
136C(C1CCCC1)C1CCCC(C1)C1CCCCC12
137C(CCC1CCCC(CC1)C1CCCCC1)CC1CCCCC12
138C(C1CC1)C1CCCC(C1)C1CCC(C1)C1CCCCC12
139C(CCCC1CCCC1)CCCC1CCC2CC(CC2C1)C1CCCCC12
140C(CC1CCCC(C1)C1CCCCC1)C1CC12
141C(CCC1CCCC(CC1)C1CC1)CC1CCCCC12
142C(CC1C2CC(CC12)C1CC2CCCCC2C1)CC1CCCC12
143C(CC1CCC(C1)C1CCCCC1)CC1CCCCC12
144C1CCC(C1)C1CCC2CCCC(C3CCCCC3)C2C12
145C(CCCC1CCCC1)CCCC1CCCC12
146C(CCCCCC1CCCCC1)CCCCC1CCCC12
147C(CC1CCCC1)CC1(CCC2CCCCC2)CCCC12
148C(CC1CCC(CCC2CCCC3CCCCC23)C1)C1CCCC12
149C(CCC1CCC(CC2CCC(CC2)C2CCCCC2)CC1)CC1CCCCC12
150C(CCC1CCCC1)CC1CCCC12
151C(CC1CCCC1)CC1CCCCC1C1CCCCC12
152C(CCC1CCCCC1)CC1CCCC1CCCC1CCCC12
153C(C1CC2CCCC2C1)C1CCCCC12
154C1CC2CCCCCCCCCCCCCCC3CCCC(C3)CCC2C12
155C(CC1CCCC1)CC1CC2CCCCC2C12
156C(CC1CC1)CC1CCC(CCCC2CCCC2)CC12
157C(CC1CCCC1)CC1CCC(CCCC2CCCC2)CC12
158C(CC1CCCC1)CC1CC2CC(CC3CCCCC3)CC2C12
159C1CCC2CCCCC2C12
160C(CC1CCC(CC1)C1CCC2CCCC2C1)C1CCCCC12
161C(CC1CCC(CC1)C1CCCC1)C1CCCCC12
162C1CC2CCC(CC2C1)C1CCCC(C1)C1CCC(C1)C1CCCCC12
163C(CC1CCC(CCC2CCC(CC2)C2CCCC2)CC1)C1CCCCC12
164C(CC1CCCC1)CC1CCCCCC12
165C(C1CCCCC1)C1CCCC(C1)C1CCCCC12
166C(CC1CCCC1)CC1CC2CCC1C22
167C(CCC1CCCCC1)CC1CCCC1CCCC1CCC2CCCC2C12
168C(CCC1CCCCC1)CC1CCCC1CCCC1CCC2CCCCC2C12
169C(C1CCCC1)C1CCCCC12
170C(C1CCCC1)C1CCC(C1)C1CCCC2CCCCC122
171C(CCC1CCC2CCCC2C1CC1CC1)CC1CCCCC12
172C(C1CCCC1)C1CCC(CC2CCCC2)C12
173C(CCC1CCCC1CC1CCCC1)CCC1CCCCC12
174C(C1CCC1)C1CCC(CC2CCCC2)C12
175C(CCCC1CCCC1)CCCC1CCC2CCCCC2C12
176C(CCC1CCC2C(CCC2C2CC2)C1)CC1CCCCC12
177C(C1CCC2CCCC2C1)C1CCC2CCCCC2C12
178C(C1CCC2CC(CC2C1)C1CCCCC1)C1CC2CCCCC2CC2CCCCC122
179C1C(CC2CCCCC12)C1CCC(CC1)C1CCCCC12
180C(C(CC1CCCCC1)C1CCC(CC2CCCC2)C1)C1CCCCC12
181C(C1C2CC(CC3CCCC4CCCCC34)CCC2CC1C1CCCCC1)C1CCCCC12
182C(CC1CCCC1)CC1CCC(CC1)C1CCCC(C1)C1CCCCC11
183C(C1CCC1)C1CCCC11
184C(C1CCCCC1)C1CC2CC(CC2CC1C1CCCCC1)C1CCCCC11
185C(CC1CCC(CC1)C1CCC(CC1)C1CCC(CC1)C1CCCCC1)C1CCCCC11
186C(C1CCCCC1)C1CCCCC1CC1CCCCCC11
187C(C1CCCC1)C1CCCCC1C1CCCCC11
188C(CC1CCC(CC2CCCCC2)C1)C1CCCC11
189C(CC1CCCC1)CC1CCC(CCC2CC2)CC11
190C(C1CCCCC1)C1CC2CCC(CC2CC1C1CCCCC1)C1CCCCC11
191C(C1CCCCC1)C1CCC(CC1C1CCCCC1)C1CCCCC11
192C(CCCC1CCCCCC1)CCC1CCCC11
193C(CCCC1C2CCC1CC2)CCC1CCCC11
194C(CCC1CCC(CC2CCCC(C2)C2CCCCC2)CC1)CC1CCCCC11
195C(CCCC1CCCC1)CCC1CC1C1CCCCC11
196C(CCC1CCC(CC1)C(C1CCCCC1)C1CCCCC1)CC1CCCCC11
197C(CC1CCCC1)CC1CCC(CCCC2CCC3CCCCC3C2)CC11
198C(CCC1CCC(CC1)C1CCCCC1C1CCCCC1)CC1CCCCC11
199C(CC1CCCC1)CC1CCC(CCCC2CC3CCCCC3C2)CC11
200C(CC1CCCC1)CC1CC2CC(CC3CCCC3)CC2C11
201C1CC2CCCCCCCCCCCCCCCCCCCCC3CCCC(C3)CCC2C11
202C(CCCC1CCC(CC1)C1CCCCC1)CCC1CCCC11
203C(C1CCCC1)C1CCC(C1)C1CCC(CC1)C1CCC2CCCCC2C11
204C(CCC1CCC(CC2CCCC2)CC1)CC1CCCCC11
205C(CC1CCCC1)CC1CC2CC(CCCC3CCCCC3)CC2C11
206C(CCCC1CCC(CC1)C1CC2CCCCC2C1)CCC1CCCC11
207C(CCC1CCCCC1)CCC1CCC(CC(CC2CCC(CCCCCC3CCCCC3)CC2)C2CCCC2CC2CCCC2)CC11
208C(CC1CC2CCCCC2C1)C1CCC(CC2CCCC2)C11
209C(C1CC2CCCCC2C1)C1CCC2CC(CC2C1)C1CCCCC11
210C(CC1CCCCC1)C1CCCC1CC1CCCC11
211C(C1CCCC1)C1CCCC1CC1CC2CCCCC2C11
212C(C1CCCC1)C1CCC(CC2CCC(CC2)C2CCC(C2)C2CCCCC2)C11
213C(CCC1CCC(CC2CCC3CCCCC23)CC1)CC1CCCCC11
214C(CC1CCCCC1)C1CCCC1CC1CC2CCCCC2C11
215C(CCC1CCCC1CCC1CCCCC1)CC1CCCC11
216C(CC1CCCCC1)C1CCC1CC1CCCC11
217C(CC1CCCCC1)C1CCC1CC1CC2CCCCC2C11
218C(CCC1CCC(CCC2CCCCC2)CC1)CC1CCCCC11
219C(C1C2CC(CC3CCCC4CCCCC34)CCC2CC1C1CCCCCC1)C1CCCCC11
220C(CC1CCC2CC(CC2C1)C1CCCCCC1)C1CCCCC11
221C(CCCC1CCC(CC2CCCCC2)CC1)CCC1CCCC11
222C(CCC1CCCCCC1CC1CCCC1)CC1CCCCC11
223C(CC1CCCC1)CC1CCC(CC1)C1CCCC(CC2CC2)C11
224C(C1CCCC1)C1CCC(CC2CCCC(CC2)C2CCCC2)C11
225C(C1CC2CCCCC2C1)C1CCC2CC(CC2C1)C1CCCCCC11
226C(CCC1CCCCC1CCC1CCCC1)CC1CCCCC11
227C(CC1CCC(CCC2CCCCC2)CC1)C1CCC2CCCCC121
228C(CC1CCC(CC1)C(CC1CCCC1)CC1CCCCC1)C1CCCCC11
229C1C(CC2CC(CCC12)C1CCCCC1)C1CCCCC11
230C1CCC(C1)C1CCC(C1)C1CCCC(C1)C1CCCCC11
231C(C1CCCC1)C1CCC2CC(CCC2C1)C1CCCC(C1)C1CCCC11
232C(CCC1CCCCC1)CC1CCCC1CCCC1CCC(CCCC2CCCCC2)CC11
233C(C1CCC2CCC(CC12)C1CCCCC1)C1CCCCC11
234C(CCC1CCCC2CCCC12)CC1CCCCC11
235C(CC1CCC1)CC1CCC(CC1)C1CCC2CCCC2C11
236C(CC1CCCC(C1)C1CCCCC1)C1CCCC11
237C(C1CC2CCCCC2C1)C1CCCC1CC1CC2CCCCC2C11
238C(CCC1CCCC1CCCC1CCCCC1)CC1CCCC11
239C(CCC1CCCC1)CC(CCC1CCCC1)CC1CCCCC11
240C(CC1CCCC1)C1CC11
241C1CCC(C1)C1CCC(C1)C1CCCC(C1)C1CCC2CCCC2C11
242C(C1CCCC1)C1CC2CCCCC2C11
243C(CCC1CCCC(CCCCC2CCCCC2)CC1)CC1CCCCC11
244C(CCCC1CCCC1CCCCC1CCCCC1)CCC1CCCC11
245C(CCC1CCC(CC2CCCCC2)CC1)CC1CCCC11
246C(C1CCCC1)C1CCC(CC2CCC3CCCC3C2)C11
247C(CCC1CCCCC1CC1CCCC1)CC1CCCCC11
248C(CCC1CCC(CC2CCCCC2)CC1CC1CCCCC1)CC1CCCCC11
249C(CC12CC3CC(CC(C3)C1)C2)C1CCC2CC121
250C(CCC1CCC(CC1)C1CCCCC1)CC1CCCC11
251C(CCC(CC1CCCCC1)C1CCCCC1)CCC1CCCC11
252C(CCC1CCC2CCCC2C1C1CC1)CC1CCCCC11
253C(CCC1CCC2C(CC3CCCCC3)CCC2C1)CC1CCCCC11
254C(CCC1CCC2CCC(CC3CCCCC3)C2C1)CC1CCCCC11
255C(CC1CCCC1)C(CC1CCC1)C1CCC(CC1)C1CCCCC11
256C(CCCC(C1CCCCC1)C1CCCCC1)CCCC1CCCC11
257C(CC1CCCC1)C(CCC1CCCC1)C1CCC(CC1)C1CCCCC11
258C(C1CCCC1)C1CCC(C1)C1CCC(CC1)C1CCC2CCCCC121
259C(CCCC1CC1C1CCCC1)CCCC1CCCC11
260C1CCC(C1)C1CCC2CCCCC121
261C(C1CCCC1)C1CCC2CCCCC2C11
262C(CCC1CCCCC1C1CCCCC1)CC1CCCCC11
263C(CC1CCCC1)C(CC1CCCC1)C1CCC(CC1)C1CCCCC11
264C1C2CCCCC2C2CCCC(C12)C1CCCCC11
265C(CCC(CC1CCCCC1)CC1CCCCC1)CCC1CCCC11
266C(C1CCCC1)C1CCC(C1)C1CCCC11
267C(C1CCCC1)C1CCC(C1)C1CCC(CC1)C(C1CCCCC1)C1CCCCC11
268C(C1CCCC1)C1CCC(C1)C1CCCC(CC1)C1CCCCC11
269C(CC1CCCC1)CC1(CCC1)C1CCCCC11
270C(C1CCCC1)C1CCC(C1)C1CCC(CC2CCCCC2)CC11
271C(CCC1CCC2CCCCC2C1)CC1CCCCC11
272C(CCCC1CCCC1)CCCC1CCC(CC1)C1CCCCC11
273C(C1CCCC1)C1CCC(C1)C1CC2CCCCC2C11
274C(C1CCCC1)C1CCC(C1)C1C2CC3CC(C2)CC1C31
275C(C1CCCC1)C1CCC(CC2CCCCCCC2)C11
276C(C1CCCC1)C1CCC(CC2CCCCCC2)C11
277C1CC(C2CC(CCC12)C1CCC(CC1)C1CCCCC1)C1CCCCC11
278C(CC1CCCC1)CC1CCCCCCCCC11
279C1CC1C1CCCC2CCC(CC12)C1CCC(CC1)C1CCCCC11
280C(CCC1CCC2CCCCC2C1CC1CCCCC1)CC1CCCC11
281C1CC1C1CC2CCC(CC2C1)C1CCC(CC1)C1CCCCC11
282C(CC1CCCC1)CC12CCC(CC1)CC21
283C1CCC2C(C1)CCC1CCCCC211
284C(CC1CCCC1)CC1CCC(CC1)C1CCCC(C1)C1CCC(C1)C1CC11
285C(CC1CCCC1)CC12CC3CC1CC(C2)C31
286C(CC1CCCC1)CC1CC2CC1CCC21
287C1CC1C1CCC2CC(CC2C1)C1CCC(CC1)C1CCCCC11
288C(CC1CC1)C(CCC1CCCC1)C1CCC(CC1)C1CCCCC11
289C(CC1CCCCC1)C1CC2CCCC2C11
290C1CCC(C1)C1CC2CCCC2C11
291C(CC1CCCC1)C(C1CCCC1)C1CCC(CC1)C1CCCCC11
292C(CCC1CCCCC1)CC1CCCC1CCCC1CCC11
293C(CCC1CCCCC1)CC1CCCC1CCC1CC11
294C(CC1CCCC1)CC1CC2CC(C2)C11
295C(CC1CCC(CC2CCC3CCCCC23)CC1)C1CCCCC11
296C(CC1CCCC1)CC1CCCCCCCCCCC11
297C(CC1CCC(CC2CCC3CCCC3C2)CC1)C1CCCCC11
298C(CC1CCC(CC2CCC3CCCCC3C2)CC1)C1CCCCC11
299C(CCC1CC2CC(C2)C1)CC1CCCC11
300C(CC1CCCC1)CC1CCC(CCCC23CC4CC(CC(C4)C2)C3)CC11
301C1CC1C1CCCC(C1)C1CCC2C(CCC3CCCCC23)C11
302C(CCC1CCCC1)CCC1CCC(CCCC2CCCC2)CC11
303C(CC1CCCC1)CC1CCC(CCCC2CCC3CCCC3C2)CC11
304C(CCC1CCC(CCCC2CCCC2)CC1)CC1CCCCC11
305C1CCC2C(C1)CCC1CC(CCC21)C1CCCCCC11
306C(CCC12CC3CC(CC(C3)C1)C2)CC1CCCC11
307C(CC1CCCC1)CC1CCC(CC2CC3CCCCC3C2)CC11
308C(CCC1CCCCC1)CC1CCCC1C1CCC(CC2CC2)C11
309C(CC1CCCC1)CC1CCC(CC1)C1CC2CCCCC2C11
310C(CCC1CCCCC1)CC1CCCC1C1CCC(C1)C1CCCC11
311C(CCC1CCC2CCCCC2C1)CC1CCC(CCC2CCCC2)CC11
312C(CCC1CCCC2CCCCC12)CC1CCC(CCC2CCCC2)CC11
313C(CCC1CCCC1)CCC1CCC(CC1)C1CCCCC11
314C(CCC1CCCC1)CCC1CCC2CC(CC2C1)C1CCCCC11
315C(CCC(C1CCCCC1)C1CCCCC1)CCC1CCCC11
316C(CCC1CCCC(CC2CCCC2)CC1)CC1CCCCC11
317C(CC1C2CC(CC12)C1CCC2CCCCC2C1)CC1CCCC11
318C(CC1C2CC(CC12)C1CCCC1)CC1CCCC11
319C(CCC1CCCC(CC2CCC3CCCC3C2)CC1)CC1CCCCC11
320C(CCC1CCCC(CCC2CCCCC2)CC1)CC1CCCCC11
321C(CCC1CCCC(CCCC2CCCCC2)CC1)CC1CCCCC11
322C(CCC1CCCC2CCCCC12)CC1CCCC11
323C(CCC1CCCCC1)CCC1(CCCCC1)C1CCCC11
324C(CCC1CCCCC1)CCC1(CCCCC1)C1CC2CCC(CCCC3CCCCC3)CC2C11
325C(CC1CCC2CC(CC2C1)C1CCC2CCCC12)C1CCCCC11
326C(CC1CCCC1)C(C1CCCCC1)C1CCCCC11
327C(CCC1CCCCC1)CCC1(CCC(CC2CCCCC2)CC1)C1CCCCC11
328C1CCCCCC11
329C1CCCCC11
330C(CC1CCC2CCCC2C1)C1CCCC11
331C(C1CCCC1)C1CC2CCC1C21
332C(CCC1CCCC1)CCC1CCC2CCCCC121

Abbreviation: SMILEs, simplified molecular-input line-entry system.

Table S6

Summary of molecular framework generated from inactive DPP4 inhibitors

NumberSMILESMember size
1C1CCCC143
2C1CCCCC132
3C(CCCC1CCCCC1)CCC1CCCCC129
4C(CCC1CCCCC1)CC1CCCC124
5C1C(CC2CCCCC12)C1CCCCC120
6C(CC1CCCC1)CC1CCCCC120
7C1CC2CCCCC2C118
8C1CCC2CCCCC2C115
9C1CCC(CC1)C1CCCC(C1)C1CCCCC114
10C1CCC(CC1)C1CCCCC112
11C(CC1CC2CCCCC2C1)C1CCC(CC2CCCC2)C112
12C(CCC1CCC(CC1)C(C1CCCCC1)C1CCCCC1)CC1CC2CCCCC2C19
13C(CCC1CCCC2CCCCC12)CC1CCCC19
14C(CC1CCCCC1)CC1CCCCC18
15C8
16C(CC1CCCC1)CC12CC3CC(CC(C3)C1)C28
17C(CCC1CCCCC1)CC1CCCCC18
18C(C(CC1CCCCC1)C1CCCC1)C1CCCCC17
19C1CCC(CC1)C1CCCC2CCCCC127
20C(C1CCCC1)C1CCCC16
21C(CC1CCCCC1)C1CCCC16
22C(C(CC1CCCCC1)C1CCC2CCCCC12)C1CCCCC16
23C(CCC1CCCCC1)CCC1CCCCC16
24C(CCC1CCCC1)CCC1CCCCC16
25C(C1CCCCC1)C1CCCC(C1)C1CCCCC16
26C(C1CCCC1)C1CCCC(C1)C1CCCCC16
27C(CC1CCCCC1)C1CCCCC15
28C1CCC(CC1)C1CCC2CCCCC2C15
29C(CC1CCCC1)CC1CCCC14
30C(C(CC1CCCCC1)C1CCCC1CC1CCCC1)C1CCCCC14
31C(C1CCCCC1)C1CCCCC14
32C(CCC1CCC2CCCC2C1)CC1CCCCC14
33C(CC1CCC(CC1)C1CCCCC1)C1CCC(CC2CCCC2)C13
34C(CC1CCCCC1)CC1CCCC(C1)C(CCC1CCCCC1)C1CCCCC13
35C(CCC1CCC(CC2CCCCC2)CC1)CC1CC2CCCCC2C13
36C(CCCC1CCCCC1)CCC1CCCC13
37C1CCCCCC13
38C(CC1CCCCC1)C(C1CCCCC1)C1CCCCC1CCC1CCCCC13
39C(CCC1CCC(CCCC2CCCCC2)CC1CC1CCCCC1)CC1CCCCC13
40C(CC1CCCC(C1)C1CCCCC1)C1CC13
41C1CCC13
42C(CC1CCC2CCCC2C1)C1CCC(CC2CCCC2)C13
43C(CC1CCCC1)CC1CCC(C1)C1CCCCC13
44C(CCC1CCCCC1)CC1CCCC1CC1CCCC13
45C(CCC1CCC2CCCC2C1)CC1CCCC13
46C(CC1CCC2CCCCC2C1)C1CCCCC13
47C(CCC1CCCC1)CC1CCCC12
48C(CCC(C1CCCCC1)C1CCCCC1)CCC1CC2CCCCC2C12
49C1C2CC(CCC2C2CCC3C(CCC4CCCCC34)C12)C1CCCCC12
50C1CCC(CC1)C1CCC2C(CCC3CCCCC23)C12
51C(CC1CCCCC1)C1CCC(CC2CC3CCCCC3C2)C12
52C1CC2CCC(CC2C1)C1CCCCC12
53C(CCCC1CCCCC1)CCC1CC2CCCCC2C1C(CC1CCCCC1)CC1CCCCC12
54C(C1CCC(CC2CCC3CCCCC3C2)C1)C1CC2CCCCC2C12
55C(CC1CCCC2CCCCC12)C1CCC(CC2CC3CCCCC3C2)C12
56C(CCC1CCCC1)CCC1CCC(C1)C1CCCCC12
57C(CC1CCCCC1)CC1CCC(CC2CCCCC2)CC12
58C(CC1CCCC1CC1CCCC1)CC1CCCCC12
59C(CCC1CCC2CCCCC2C1)CC1CC2CCCCC2C12
60C(CC1CCCCC1)CC12CC3CC(CC(C3)C1)C22
61C(CCC1CCCCC1)CC(CC1CCCCC1)CC1CCCCC12
62C(CCC1CCC(CC2CCC3CCCCC3C2)CC1)CC1CC2CCCCC2C11
63C(CCC1CCC(CC2CCC(CC2)C2CCCCC2)CC1)CC1CC2CCCCC2C11
64C(CCC1CCC(CC2CCCC(C2)C2CCCCC2)CC1)CC1CC2CCCCC2C11
65C(CCC1CC2CC1CC2C(C1CCCCC1)C1CCCCC1)CC1CC2CCCCC2C11
66C(CCC1CCCC(CC1)C(C1CCCCC1)C1CCCCC1)CC1CC2CCCCC2C11
67C(CCC1CCC(CC(C2CCCCC2)C2CCCCC2)CC1)CC1CC2CCCCC2C11
68C(CC1CCCCC1)C1CC2CCCCC2C11
69C(CCC1CCC(CC2C3CCCCC3C3CCCCC23)CC1)CC1CC2CCCCC2C11
70C(CCC1CCC(CC1)C1CCCCC1)CC1CC2CCCCC2C11
71C(CC1CCC2C1CCC1C2CCC2CCCCC12)CC1CCCCC11
72C(CC1CCC(CC2CCCC2)C1)CC1CCCCC11
73C(CC1CCCC1)CC1CCC(CC2CCCCC2)CC11
74C(CC1CCCCC1)C1CCC(CC2CCCC2)C11
75C(CC1CCCCC1)CC1CCC(CCCC2CCCCC2)CC11
76C(CC1CCC(CC2CCCC2)C1)C1CCCC11
77C1CC2CCC3C(CCC4CCCCC34)C2C11
78C(CC1CCCCC1)C1CC2CCC3C(CCC4CCCCC34)C2C11
79C(CCC1CCCC(C1)C1CCCCC1)CC1CCCCC11
80C(CC1CCCCC1)CC1CCCC(C1)C1CCCCC11
81C(CC1CCCC(C1)C1CCCCC1)C1CCCC11
82C(CCC1CCC(CC1)C1CCCC1)CC1CCCC11
83C(CCC1CCCC(C1)C1CCCCC1)CC1CCCC11
84C(CCC1CCC2CCCCC2C1)CC1CCCC11
85C(CCC1CCCC1)CCC1CCC(CC1)C(C1CCCCC1)C1CCCCC11
86C(CCC1CCCC1)CCC1CCCC11
87C(CCC1CCCC(CC2CCCC2)C1)CC1CCCC11
88C(CCCC1CCCC1)CCCC1CCC(CC1)C(C1CCCCC1)C1CCCCC11
89C1CCC(C1)C1CCCCC11
90C(CCC1CCCC2CCCCC12)CC1CCCCC11
91C(CCCCC1CCCC1)CCCCC1CCCCC11
92C(CCCCC1CCCC1)CCCC(C1CCCCC1)C1CCCCC11
93C(CCCC1CCCC1)CCCC1CCC(CC2CCCCC2)CC11
94C(CC1CCC2CCCCC12)CC1CCC2CCCCC2C11
95C(CCC1CCC(C1)C(C1CCCCC1)C1CCCCC1)CCC1CC2CCCCC2C11
96C(C1CCCC1)C1CCC2CCCCC2C11
97C(CCC1CC2CCCCC2C1)CCC1CCC(CC1)C(C1CCCCC1)C1CCCCC11
98C(CCC1CCCC1CCC(C1CCCCC1)C1CCCCC1)CC1CC2CCCCC2C11
99C(CCC1CC2CCCCC2C1)CCC1CCC(CC2CCCCC2)CC11
100C1CC2CCC(CC2C1)C1CCCC(C1)C1CCCCC11
101C(CC1CCC(CC1)C1CCC(CC1)C1CCCC1)C1CCCCC11
102C(CCC1CCCC1)CC(C1CCCCC1)C1CCCCC11
103C(CCC1CCCC1CC1CCC(CC1)C(C1CCCCC1)C1CCCCC1)CC1CC2C1CCCC2C11
104C(CCC1CC2CCCCC2C1)CCC12CC3CC(CC(C3)C1)C21
105C(C1CCCC1)C1CCCC1CC1CC2CCCCC2C11
106C(CC1CCC1)CC12CC3CC(CC(C3)C1)C21
107C(CC1CCCC1)CC1CCCC1C(CC1CCCCC1)CC1CCCCC11
108C(CC1CCCCC1)C(C1CCCCC1)C1CCC(CCC2CCCCC2)CC11
109C(CC1CCCCC1)C(C1CCCCC1)C1CCC2CC(CCC2C1)C1CCCCC11
110C(CC1CCCC1)CC1CC2CCCCC2C11
111C(CCCC1CCCC1)CCCC1CCCCC11
112C(C1CCCCC1)C1CCC2CCC(CC2C1)C1CCCCC11
113C(CCCCCC1CCCCC1)CCCCC1CC2CCCCC2C11
114C(CC1CCCC1C(CC1CCCCC1)CC1CCCCC1)CC1CCCCC11
115C(C(CC1CCCCC1)C1CCCCC1)C1CCCC11
116C(CC1CCC2CCCCC2C1)CC12CC3CC(CC(C3)C1)C21
117C(CCC1CCCC1)CC(C(CC1CCCCC1)CC1CCCCC1)C1CCCCC11
118C(CC1CCCCC1)CC1CCC(CCCC2CCCCC2)C(CC2CCCCC2)C11
119C(CC1CCCC1)CC1CCCC1C1CCCCC11
120C(CCCCC1CCCCC1)CCCC1CCCCC11
121C(CCCC1CCCCC1CCC1CCCCC1)CCC1CCCCC11
122C(CCCC1CCC2CCCCC2C1)CCC1CCCCC11
123C1CC2CCCC2C11
124C(CCCC1CCCCC1)CCCC1CCCCC11
125C(CCCC1CCC(CCC2CCCCC2)CC1)CCC1CCCCC11
126C(CCCC1CCCC2CCCCC12)CCC1CCCCC11
127C1CCC(CC1)C(C1CCCCC1)C1CCCCC11
128C(C1CCCCC1)C1CCC(CC1)C1CCCCC11
129C(CCCCC1CCCCC1)CCCC1CC2CCCCC2C11
130C(CC1CCCC(CC2CC3CCCCC3CC2C2CCCCC2)C1)C1CCCCC11
131C(C1CCCC1)C1CC2CCCC2C11
132C(CC1CCC2CCCCC2C1)C1CCCC11
133C(CC1CC2CCCCC2C1)CC1CCC2CCCCC2C11
134C(CCC1CCCCC1)CC1CC2CCCCC2C11
135C(CC1CC2CCCC2C1)CC1CCCCC11
136C(CC1CCC2CCCCC2C1)C1CCC(CC2CCCC2)C11
137C(CCC1CCCCCC1)CC1CCCCC11
138C(CC1CCCCC1)C1CCC(C1C1CCCCC1)C1CCCCC11
139C1CCC(C1)C1CCC2C(CCC3CCCCC23)C11
140C(CC1CCCC1)CC1CCC(CCCC2CCCCC2)CC11
141C(CC1C2CC(CC12)C1CCCCC1)CC1CCCC11
142C(CC1CCCCC1C1CCCCC1)C1CCCC1CC1CCCC11
143C(C1CCCC1)C1CCCC1C1CC2CCCCC2C2CCCCC2C11
144C(CCCC1CC2CCCCC2C1)CCC(C1CCC2CCCCC2C1)C1CCC2CCCC1C2C11
145C(CCCC1CC2CCCCC2C1)CCC(CC1CCCCC1)CC1CCCCC11
146C(CCCC1CC2CCCCC2C1)CCCC1CCC2CCCCC2C11
147C(C1CCCC1)C1CC2CCCCC2C11
148C(CCCCC(CCCCC1CCCCC1)CC1CCCC1)CCCCC1CCCCC11
149C(CCCCCC1CCCC2CCCCC12)CCCCC1CCCCC11
150C(CCCCCC1CCCCC1)CCCCC1CCCC11
151C(C(CC1CCCCC1)C1CCC(CC2CC3CCCCC3C2)C1)C1CCCCC11
152C(CC1CCCC1)CC1CCC2CCCCC121

Abbreviation: SMILEs, simplified molecular-input line-entry system.

Table S7

Summary of important structural fingerprints ranked by the MDGI

RankFingerprintStructureFingerprint occurrence
MDGI
ActivesInactives
1KRFP4541 98439.197
2KRFP2428 84563.610
3KRFP3668 1522.312
4KRFP0610 3621,6852.227
5KRFP3616 16101.813
6KRFP3405 313321.563
7KRFP0223 31341.400
8KRFP2650 501.119
9KRFP1945 911.021
10KRFP0018 7290.727
11KRFP0605 2841,1820.588
12KRFP1144 5800.587
13KRFP0566 55840.581
14KRFP0344 3019350.572
15KRFP3025 4581,8740.511
16KRFP3561 2580.407
17KRFP3713 292470.391
18KRFP0496 0490.382
19KRFP2200 200.381
20KRFP0621 2467760.341
21KRFP3152 1620.320
22KRFP3966 7130.302
23KRFP3081 0700.278
24KRFP3920 590.214
25KRFP4261 3730.185
26KRFP3369 1897890.161
27KRFP0677 2361,0260.155
28KRFP0508 110.137
29KRFP2264 851670.131
30KRFP3602 4520.123

Abbreviation: MDGI, mean decrease of Gini index.

Table S8

Applying Lipinski’s rule of five on investigated data sets

Data setsTotalActivesInactives
DPP4-TRN2,339/2,609 (89.651%)1,961/2,075 (94.506%)478/534 (89.513%)
DPP4-TEST1222/325 (68.308%)
DPP4-TEST2215/325 (66.154%)
DPP4-TEST3301/325 (92.615%)

Note: Values shown are for compounds passing the Lipinski’s rule of five/in relation to the total number of compounds (values in parentheses are percentages passing the Lipinski’s rule of five).

  46 in total

1.  Advances in computational methods to predict the biological activity of compounds.

Authors:  Chanin Nantasenamat; Chartchalerm Isarankura-Na-Ayudhya; Virapong Prachayasittikul
Journal:  Expert Opin Drug Discov       Date:  2010-05-22       Impact factor: 6.098

2.  [(S)-gamma-(4-Aryl-1-piperazinyl)-l-prolyl]thiazolidines as a novel series of highly potent and long-lasting DPP-IV inhibitors.

Authors:  Tomohiro Yoshida; Hiroshi Sakashita; Fumihiko Akahoshi; Yoshiharu Hayashi
Journal:  Bioorg Med Chem Lett       Date:  2007-02-08       Impact factor: 2.823

Review 3.  Dipeptidyl peptidase IV and its inhibitors: therapeutics for type 2 diabetes and what else?

Authors:  Lucienne Juillerat-Jeanneret
Journal:  J Med Chem       Date:  2013-10-24       Impact factor: 7.446

Review 4.  Inhibition of the activity of dipeptidyl-peptidase IV as a treatment for type 2 diabetes.

Authors:  J J Holst; C F Deacon
Journal:  Diabetes       Date:  1998-11       Impact factor: 9.461

5.  AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility.

Authors:  Garrett M Morris; Ruth Huey; William Lindstrom; Michel F Sanner; Richard K Belew; David S Goodsell; Arthur J Olson
Journal:  J Comput Chem       Date:  2009-12       Impact factor: 3.376

6.  Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection.

Authors:  Igor V Tetko; Iurii Sushko; Anil Kumar Pandey; Hao Zhu; Alexander Tropsha; Ester Papa; Tomas Oberg; Roberto Todeschini; Denis Fourches; Alexandre Varnek
Journal:  J Chem Inf Model       Date:  2008-08-26       Impact factor: 4.956

Review 7.  Recent approaches to medicinal chemistry and therapeutic potential of dipeptidyl peptidase-4 (DPP-4) inhibitors.

Authors:  Bhumika D Patel; Manjunath D Ghate
Journal:  Eur J Med Chem       Date:  2014-01-11       Impact factor: 6.514

8.  Discovery and preclinical profile of teneligliptin (3-[(2S,4S)-4-[4-(3-methyl-1-phenyl-1H-pyrazol-5-yl)piperazin-1-yl]pyrrolidin-2-ylcarbonyl]thiazolidine): a highly potent, selective, long-lasting and orally active dipeptidyl peptidase IV inhibitor for the treatment of type 2 diabetes.

Authors:  Tomohiro Yoshida; Fumihiko Akahoshi; Hiroshi Sakashita; Hiroshi Kitajima; Mitsuharu Nakamura; Shuji Sonda; Masahiro Takeuchi; Yoshihito Tanaka; Naoko Ueda; Sumie Sekiguchi; Takayuki Ishige; Kyoko Shima; Mika Nabeno; Yuji Abe; Jun Anabuki; Aki Soejima; Kumiko Yoshida; Yoko Takashina; Shinichi Ishii; Satoko Kiuchi; Sayaka Fukuda; Reiko Tsutsumiuchi; Keigo Kosaka; Takahiro Murozono; Yoshinobu Nakamaru; Hiroyuki Utsumi; Naoya Masutomi; Hiroyuki Kishida; Ikuko Miyaguchi; Yoshiharu Hayashi
Journal:  Bioorg Med Chem       Date:  2012-08-17       Impact factor: 3.641

9.  2D-Qsar for 450 types of amino acid induction peptides with a novel substructure pair descriptor having wider scope.

Authors:  Tsutomu Osoda; Satoru Miyano
Journal:  J Cheminform       Date:  2011-11-02       Impact factor: 5.514

10.  Chemical substructures that enrich for biological activity.

Authors:  Justin Klekota; Frederick P Roth
Journal:  Bioinformatics       Date:  2008-09-10       Impact factor: 6.937

View more
  11 in total

1.  Insight into potent TLR2 inhibitors for the treatment of disease caused by Mycoplasma pneumoniae based on machine learning approaches.

Authors:  Muhammad Ishfaq; Ziaur Rahman; Muhammad Aamir; Ihsan Ali; Yurong Guan; Zhihua Hu
Journal:  Mol Divers       Date:  2022-04-29       Impact factor: 2.943

2.  Comparative Binding Analysis of Dipeptidyl Peptidase IV (DPP-4) with Antidiabetic Drugs - An Ab Initio Fragment Molecular Orbital Study.

Authors:  Sundaram Arulmozhiraja; Naoya Matsuo; Erika Ishitsubo; Seiji Okazaki; Hitoshi Shimano; Hiroaki Tokiwa
Journal:  PLoS One       Date:  2016-11-10       Impact factor: 3.240

Review 3.  Machine Learning and Data Mining Methods in Diabetes Research.

Authors:  Ioannis Kavakiotis; Olga Tsave; Athanasios Salifoglou; Nicos Maglaveras; Ioannis Vlahavas; Ioanna Chouvarda
Journal:  Comput Struct Biotechnol J       Date:  2017-01-08       Impact factor: 7.271

Review 4.  Proteomic and bioinformatic discovery of biomarkers for diabetic nephropathy.

Authors:  Chadinee Thippakorn; Nalini Schaduangrat; Chanin Nantasenamat
Journal:  EXCLI J       Date:  2018-03-26       Impact factor: 4.068

Review 5.  Unraveling the bioactivity of anticancer peptides as deduced from machine learning.

Authors:  Watshara Shoombuatong; Nalini Schaduangrat; Chanin Nantasenamat
Journal:  EXCLI J       Date:  2018-07-25       Impact factor: 4.068

6.  Meta-iAVP: A Sequence-Based Meta-Predictor for Improving the Prediction of Antiviral Peptides Using Effective Feature Representation.

Authors:  Nalini Schaduangrat; Chanin Nantasenamat; Virapong Prachayasittikul; Watshara Shoombuatong
Journal:  Int J Mol Sci       Date:  2019-11-15       Impact factor: 5.923

7.  PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method.

Authors:  Phasit Charoenkwan; Sakawrat Kanthawong; Nalini Schaduangrat; Janchai Yana; Watshara Shoombuatong
Journal:  Cells       Date:  2020-02-03       Impact factor: 6.600

8.  ThalPred: a web-based prediction tool for discriminating thalassemia trait and iron deficiency anemia.

Authors:  V Laengsri; W Shoombuatong; W Adirojananon; C Nantasenamat; V Prachayasittikul; P Nuchnoi
Journal:  BMC Med Inform Decis Mak       Date:  2019-11-07       Impact factor: 2.796

9.  Exploring the chemical space of influenza neuraminidase inhibitors.

Authors:  Nuttapat Anuwongcharoen; Watshara Shoombuatong; Tanawut Tantimongcolwat; Virapong Prachayasittikul; Chanin Nantasenamat
Journal:  PeerJ       Date:  2016-04-19       Impact factor: 2.984

10.  iQSP: A Sequence-Based Tool for the Prediction and Analysis of Quorum Sensing Peptides via Chou's 5-Steps Rule and Informative Physicochemical Properties.

Authors:  Phasit Charoenkwan; Nalini Schaduangrat; Chanin Nantasenamat; Theeraphon Piacham; Watshara Shoombuatong
Journal:  Int J Mol Sci       Date:  2019-12-20       Impact factor: 5.923

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.