Literature DB >> 34900139

Structure-based protein-ligand interaction fingerprints for binding affinity prediction.

Debby D Wang¹, Moon-Tong Chan², Hong Yan³.

Abstract

Binding affinity prediction (BAP) using protein-ligand complex structures is crucial to computer-aided drug design, but remains a challenging problem. To achieve efficient and accurate BAP, machine-learning scoring functions (SFs) based on a wide range of descriptors have been developed. Among those descriptors, protein-ligand interaction fingerprints (IFPs) are competitive due to their simple representations, elaborate profiles of key interactions and easy collaborations with machine-learning algorithms. In this paper, we have adopted a building-block-based taxonomy to review a broad range of IFP models, and compared representative IFP-based SFs in target-specific and generic scoring tasks. Atom-pair-counts-based and substructure-based IFPs show great potential in these tasks.

Entities: Chemical

Keywords: Computer-aided drug design; Interaction fingerprint; Machine learning; Protein–ligand binding affinity; Scoring function

Year: 2021 PMID： 34900139 PMCID： PMC8637032 DOI： 10.1016/j.csbj.2021.11.018

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Attributed to the advances in high-resolution structure determination [1] and computational methodologies for structural analysis and molecular design [2], [3], structure-based drug design (SBDD) has developed into a robust and promising technique for drug discovery [4], [5]. As an important participant in SBDD, molecular docking has proven to be a valuable tool for identifying novel hit compounds from a large chemical library for a particular target [6]. Binding affinity prediction (BAP) for putative protein–ligand complexes is essential in this process, and is generally achieved through scoring functions (SFs) [7], [8], [9], [10], [11]. An SF can be evaluated according to its performance in multiple tasks, including docking (identifying near-native binding modes), screening (distinguishing active binders from decoys), ranking (correctly ranking the binding affinities of the ligands for a given target) and scoring (achieving a linear correlation between the predicted binding scores and experimental binding data). Different applications of SFs emphasize one or more of these tasks, such as virtual screening (docking and screening) and lead optimization (ranking and scoring). Classical SFs generally prioritize the rapid screening speed over accurate prediction of binding affinity, and thus hardly perform well in scoring and ranking tasks [12], [13], [14]. Improving the scoring and ranking powers is a requisite to the development of SFs, but remains to be a challenge in SBDD. In recent years, the extensive use of machine learning has been refocused from quantitative structure–activity relationship (QSAR) studies [15] onto structure-based predictive modeling [16], [17], [18]. The increasingly available structural and binding-affinity data of protein–ligand complexes, which allow the training of BAP models, have led to a surge in machine-learning SFs [19]. These SFs can handle a large volume of structural data, and have been demonstrated to outperform classical SFs in scoring works [20], [19]. Feature engineering is crucial to the construction of a machine-learning SF. This process translates a complex structure into a series of descriptors, and is often guided by the knowledge of biologically relevant interactions such as hydrogen bonds, hydrophobic contacts, ionic interactions (salt bridges), -stacking and -cation interactions [21]. Recently, interaction fingerprints (IFPs) have become a study focus of SF descriptors, due to the simple representations and elaborate profiles of key interactions. Given a protein–ligand complex structure (Fig. 1A), IFPs are generally defined based on protein–ligand interacting atoms (Fig. 1A), and stored as 1-dimensional vectors or matrices of Booleans, integers or floating-point numbers. In earlier works, structural IFPs have only been classified roughly, such as keyed fingerprints vs. feature collections [22] and ligand descriptors vs. protein descriptors [20]. Moreover, the heterogeneity in benchmark data, pre-processing procedures, implementations and evaluation metrics results in the incomparability among IFP-based scoring works. Accordingly, in this paper we have reviewed a broad range of IFP models according to a building-block-based taxonomy, and compared the SFs incorporating representative IFP models in several scoring tasks (target-specific or generic). SFs based on other descriptors or IFP applications in tasks other than scoring (such as screening) are out of scope of this review and can be found in previous reviews [23], [24].

Fig. 1

Example of a protein–ligand complex and the interacting atoms. (A) Complex of HIV-1 protease and its inhibitor (PDB ID:2QNQ). (B) Protein–ligand interacting atoms defined by a distance threshold ().

Protein–ligand IFPs

Building blocks vary from one type of IFPs to another. They primarily include structural elements (protein residues, protein atoms and molecular substructures) and intermolecular interactions (atom pairs and interaction pairs/triplets). Accordingly, we classify a variety of IFP models based on their building blocks, and review them as follows.

IFPs based on protein residues

Structural interaction fingerprint (SIFt) was a pioneer study in representing and analyzing 3D protein–ligand binding interactions [25]. Each SIFt is a simple 1D binary string generated by identifying a common panel of binding-site residues, using seven bits to represent each residue in the panel, and concatenating the bit strings of all residues (Fig. 2A). The bit string for each residue covers different types of interactions, including (1) contact with the ligand, (2) binding with any protein main-chain atom, (3) binding with any protein side-chain atom, (4) polar interaction, (5) nonpolar interaction, (6) hydrogen bond with acceptor in protein, and (7) hydrogen bond with donor in protein. Accordingly, a SIFt can be denoted aswhere is 0 or 1, and denotes whether an interaction of type j exists between protein residue and the ligand. Based on SIFt, a number of extensions have been developed. PyPLIF adopts an analogous list of protein–ligand interactions (apolar, aromatic, hydrogen bond and electrostatic) to generate IFPs [26]. In r-SIFt, each bit indicates whether a specific R group or core fragment of the ligand interacts with a specific protein residue [27]. Three additional interactions (hydrophobic, aromatic and charged) to those of SIFt were considered in [28]. Marcou et al. defined 11 types of protein–ligand interactions based on a list of atom flags and geometric criteria, including hydrophobic, aromatic (face-to-face), aromatic (edge-to-face), hydrogen bond (acceptor in ligand), hydrogen bond (donor in ligand), ionic bond with ligand negatively charged, ionic bond with ligand positively charged, weak hydrogen bond (acceptor in ligand), weak hydrogen bond (donor in ligand), -cation and metal complexation [29]. By default, their IFP model only considers the first seven interactions (most frequent) for each protein residue, but the remaining interactions (weak hydrogen bonds, -cation interactions and metal complexation) can be easily incorporated. Weighted SIFt (w-SIFt) introduces a weight to each interaction bit to capture the relative importance of each bit for binding, with the weights determined by stochastic optimization techniques or as simple averages at each bit position [30]. A w-SIFt model has been proposed in [31], which assigns the weights for electrostatic interactions (positive charge, negative charge and metal-binding interactions) as 2 and the weights for the rest (hydrogen-bond donor/acceptor, stacking and hydrophobic interactions) as 1. Wojcikowski et al. filled each IFP position with continuous features, ranging from van der Waals potential (Opls2005 force field), hydrogen bonds and halogen bonds (donor-hydrogen distance and donor/acceptor counts), salt bridges, -interactions and -cation interactions between a protein residue and the ligand [32].

Fig. 2

Interpretations of the construction processes of representative IFPs. (A) SIFt. (B) CHIF. (C) Atom-pair counts used by RF-Score. (D) APIF. (E) SPLIF.

Interpretations of the construction processes of representative IFPs. (A) SIFt. (B) CHIF. (C) Atom-pair counts used by RF-Score. (D) APIF. (E) SPLIF. Aside from the aforementioned models, another type of residue-based IFPs employ molecular interaction energy components (MIECs) for each position. By combining the molecular dynamics (MD) simulation techniques and MM-GB/SA free energy decomposition approach, Sun et al. developed an MIEC-based IFP [33]. Autodock was used in their work to produce the initial protein–ligand complex structures, which were further optimized through explicit-solvent MD simulations (three stages, total 5 picoseconds). Based on each optimized complex structure, the MM-GB/SA approach was employed to calculate residue-ligand binding free energy and its components (MIECs, Eq. 2).where and denote van der Waals interaction, electrostatic interaction, and the polar/non-polar parts of solvation free energy, respectively. Ligand-binding residues were selected according to the ranking of average , and different MIECs of these residues constitute the IFPs, such as Later, the same group proposed a similar IFP model, which combines Glide docking, implicit-solvent MD simulations, MM-GB(PB)/SA approach for energy decomposition and threshold-based identification of ligand-binding residues [34]. Protein–ligand Empirical Interaction Components (PLEIC) method [35] identifies a panel of residues according to three types of interactions with the ligand (van der Waals interaction, hydrophobic contact and hydrogen bond), and empirically calculated such MIECs as follows.where and represent the three types of interaction energies between residue i and the ligand, is the distance between atom j in the ligand and atom k in residue denotes the sum of atomic radii of j and k, and only hydrogen-bond donors and acceptors in the ligand and residue i are considered when calculating . Similarly, Glide was adopted for docking each ligand to the target and estimating the MIECs between the ligand and protein residues (within from the ligand center) in Yasuo et al.’s work [36], where van der Waals interactions, electrostatic interactions and hydrogen bonds were considered as MIECs. Ji et al. adopted different protocols of MD simulations for pre-processing each Glide-docked complex structure, and constructed IFPs (Eq. 3) based on MM-GB/SA free-energy-decomposition calculations for the key residues () [37].

IFPs based on protein atoms

Analogous to SIFt, knowledge-based IFP (KB-IFP) was proposed as a similarity-search tool for reference-based scoring [38]. It is a bit string as long as the number of binding-site heavy atoms, and generated based on the interactions detected by pairwise interatomic parameters (distance d and Hydrogen-bond angle a). Interactions of hydrogen bonds ( and ) and close contacts ( sum of vdW radii) are identified first, and a bit in KB-IFP is set to 1 if the corresponding heavy atom () forms an interaction with an atom in the ligand (Eq. 5). These IFPs are named as HIF (hydrogen bond-based), CIF (close contact-based) or CHIF (considering both types of interactions, Fig. 2B). A similar atom-based IFP model considers the properties of hydrogen bonds within a binding site (strength, accessibility of hydrogen-bond groups and geometric arrangement) [39]. In addition, some atom-based IFPs consider the energy terms related to each binding-site atom. A protein per atom score contribution derived interaction fingerprint (PADIF) for protein–ligand complexes [40] characterizes the protein binding-site atoms using the per atom contributions of the GOLD Score, which involves a hydrogen-bonding term and multiple linear potentials to model van der Waals and repulsive terms [41].

IFPs based on atom-pair counts

Atom-pair counts (intermolecular interaction features) were used to construct RF-Score, which is arguably the first machine-learning SF [42]. Among protein–ligand atom pairs that interact within a certain distance range (), the counts of specific pairs, classified by the atom types, are the main SF-descriptors. Specifically, atom types of for protein atoms and for ligand atoms were considered, constituting 36 descriptors (e.g. ). This list of descriptors can also be regarded as an integer IFP (Fig. 2B) and expressed aswhere represents the number of interactions of type i (e.g. ). These simple descriptors can be easily calculated and often lead to fast scoring works. CScore was proposed later [43], to further subdivide the interactions into repulsive and attractive types by introducing two fuzzy membership functions. Atom type of H was additionally considered in this work, yielding the following integer IFP, OnionNet is a recent SF that employs atom-pair counts as descriptors and uses deep-learning models for affinity prediction [44]. This method starts from each ligand atom, and defines a series of distance bins for the atom (: ). Considering eight element types (: halogen atoms : remaining types) for protein and ligand atoms, atom-pair counts within the N distance bins form a 2D integer IFP aswhere counts the atom pairs of type i (e.g. ) and within . Such IFPs (Onionnet setting: ) were extracted and fed into deep convolutional neural networks (CNNs) for affinity prediction. Recently, the same group has proposed the OnionNet-2 model, which modifies the original atom pairs into pairs of a protein residue and a ligand atom [45]. Specifically, 21 residue types (20 standard types and an expanded type) were considered, and the residue-atom distance was calculated as the distance between the ligand atom and the nearest heavy atom in the residue. As another extension of RF-Score, extended connectivity interaction features (ECIFs) model employs six atomic properties (atom type, explicit valence, connections to heavy atoms, connections to hydrogens, aromaticity and ring membership) to define the types of atom pairs [46]. For example, ‘’indicates an oxygen atom with an explicit valence of 2, connected to 1 heavy atom, and having no aromaticity nor a ring membership. Accordingly, ‘’-‘’represents a specific type of atom pairs. These descriptors lead to an integer IFP with a length of 1540. In this study (GBDT-based), a series of distance cutoffs ( to with an interval of ) were used to generated ECIFs, and the cutoff of was reported to result in the best predictions.

IFPs based on pairs or triplets of interactions

Atom-pair-based interaction fingerprints (APIF) consider interactions between pairs of protein and pairs of ligand atoms[47]. An APIF is generated in four steps: identification of the active site (threshold of ), detection of protein–ligand interactions (hydrogen bonds and hydrophobic contacts), classification of pairwise interactions and construction of the final IFP (Fig. 2D). Six types of pairwise protein–ligand interactions can be detected [47]. For each pairwise interaction, the distance between the two protein atoms () and that between the two ligand atoms () are each mapped into 7 distance bins ( and ). The final APIF is a string of bits as expressed in Eq. 9,where indicate the occurrence of type-i pairwise interactions with in and in . Pharmacophore-based interaction fingerprints (Pharm-IF) are similar to APIF, but characterize each pairwise interaction using the distance and pharmacophore features of the two ligand atoms [48]. To generate Pharm-IF, protein–ligand interactions (hydrogen bond, ionic and hydrophobic) are identified. Pairwise interactions are formed by combining these individual interactions, such as . Then the Pharm-IF of a protein–ligand complex is constructed as the matrix,where tp is the type of pairwise interaction, p is a pairwise interaction of indicates the boundaries of distance bins for the ligand atoms in p (e.g. ), and is defined as:where is the distance between the two ligand atoms in p. Triplets of interactions to construct IFPs (TIFPs) [20] were identified based on specific pharmacophoric properties (hydrophobic, aromatic, hydrogen-bond donor/acceptor, ionic and metal) and geometric criteria, with each interaction represented by an interaction pseudoatom (IPA). Then triplets of IPAs are characterized by the pharmacophoric properties of the IPAs and their related distances (binned into ). These triplets can be pruned into 210 integers (a TIFP string) by removing redundancy and validating the geometry, with each integer registering the count of IPA triplets occurring at the binned distances.

IFPs based on molecular substructures

Molecular substructures or fragments are frequently used to cluster compounds and for ligand-based virtual screening [49], [50]. Structural fingerprints like the extended connectivity fingerprints (ECFPs) [51] are representatives who employ these descriptors. The ECFP of a molecule is a folded string of integer identifiers, which are iteratively assigned to circular atom environments (substructures) up to a pre-defined bond radius (R). Initially, each heavy atom () is assigned with an identifier based on a group of properties (e.g. mass, charge or connections), and such identifiers are iteratively updated to cover larger atom environments (). For a heavy atom in a molecule, suppose are its neighboring atoms, are the bonds connecting and its neighbors, and is the identifier of the environment centered at x and with a radius of . Then the identifier of the atom environment with center and radius r is derived by combining and through a hash procedure. Most substructure-based IFPs are based on ECFPs, and can be classified as ligand-centric or interaction-centric. As an early ligand-centric, substructure-based IFP, the Interaction Annotated Structural Features (IASF) method [22] calculates atom-based energy scores based on the FlexX scoring function, which considers neutral hydrogen bonds, salt bridges, aromatic interactions, lipophilic contacts and van der Waals contacts [52]. The substructures (circular atom environments) at the binding-site region are generated by the ECFP4 () algorithm and annotated with cumulative atom-based energy scores. The scores of the substructures () can be organized as an IFP (Eq. 12) or accumulated as the full score for the host-ligand interaction. Different types of molecular fragments, including interacting fragments (IFs), random fragments (RFs) and interaction-compromised fragments (ICFs), have been defined for protein–ligand complexes [53] and can be mapped to fingerprints according to MACCS structural keys [54]. Here the IFs include ligand atoms involved in hydrogen bonds, ionic interactions or van der Waals interactions with the protein [53]. As an extension of IFs, atom-centered interaction fragments (AIFs) have been used to construct IFPs for similarity searches [55]. The AIF of each atom in an IF can be formed by combining it with its direct neighbors based on a pre-defined bond radius (2, 4 or 6 in [55]), analogous to ECFP substructures. AIFs form a library of unique descriptors, which can be used to encode a binary or integer fingerprint. Differing from IFPs that are based on a specific list of interaction types, interaction-centric, substructure-based IFPs implicitly account for all types of local interactions and do not require pre-defined geometric criteria to identify interactions [56]. Structural protein–ligand interaction fingerprints (SPLIF), a pioneer study, explicitly encode interacting substructures of a protein–ligand complex [57]. A SPLIF can be generated by identifying interacting atoms (distance within ), extracting circular substructures and folding the information of pairwise substructures (Fig. 2E). These substructures are centered on interacting atoms and detected by the ECFP2 () algorithm. Each position in a SPLIF () indicates the existence or occurrence of a specific pair of substructures. Protein–ligand extended connectivity fingerprints (PLEC FP) consider protein–ligand contact substructures with multiple pairs of radii or depths (e.g. and ), and hash these substructure pairs to specific fingerprint positions [58]. Proteo-chemometric IFPs (PrtCmm IFPs) [59] can be generated by separately encoding interacting substructures (produced by the ECFP algorithm) of the protein and ligand, and concatenating the two fingerprints.

Comparison of IFP Scores

Machine-learning SFs that absorb IFPs as descriptors, IFP Scores for short, play an important role in BAP problems. A scoring task in BAP can be either target-specific (multiple ligands for the same target) or generic (multiple targets). An overiew of the aforementioned IFPs and the related IFP Scores is presented in Table 1. Here the IFPs for RF-Score, CScore, OnionNet and OnionNet-2 are named as atom-pair counts (APC), evolved atom-pair counts (EAPC), atom-pair counts in distance bins (APCiDB) and residue-atom-pair counts in distance bins (RAPCiDB), respectively. The reported performances of the IFP Scores, mostly on well-acknowledged benchmarks (PDBbind Core Sets), are also listed. Unfortunately, these scoring works are hardly comparable, due to the disagreement in training/benchmark data (different databases or different versions of the same database), data pre-processing procedures, machine-learning models, implementation details and evaluation metrics. Meanwhile, only few toolkits have been released to offer the use of IFPs (Table 2), the majority of which are the extensions of SIFt. This makes the comparison among IFPs in BAP more difficult. In this work, we used Python to re-program representative IFP models and attempted to provide a comparison among the related SFs, using a uniform setting. Specifically, we use the raw data (no further processing) in PDBbind v2019 [60] for training and validating SFs. The PDBbind database, which covers the structural and affinity data of a variety of protein–ligand complexes (around 18,000), has been extensively used in BAP works. The developers have further filtered these complexes by multistep quality control (Refined Set and Core Set), which kept the ligands of interest and maximumly guaranteed the diversity of ligands for each target. Several classic machine-learning models (RFs, GBDTs and regression trees) were adopted in SF construction, and Pearson’s correlation coefficient () and root-mean-square error () between the predicted and experimental binding affinities were used as the evaluation metrics.

Table 1

An overview of different IFP models and the related scoring tasks.

Category	IFP	Ref	Format			Target-specific scoring			Generic scoring (evaluated on PDBbind Core Set)
			Binary	Integer	Floating number	Target	Machine learning algorithm ^a	Evaluation ^b	Version	Machine learning algorithm^a	Evaluation ^b
Residue-based	SIFt, r-SIFt	[25], [29], [28], [26], [27]	√			–	–	–	–	–	–
	w-SIFt	[30], [31]			√	p38α[30]	–	R=0.604[30]	–	–	–
	Continuous IFP	[32]			√	HIV-1 protease	GBDT	R=0.77+0.007/ RMSE=1.48+0.02	–	–	–
	MIEC-IFP	[33], [34], [35], [36], [37]			√	5HT2AR, CB1, M1R, VEGFR2, ERK2, A2AR [37]	multiple	R2¯=0.53/ RMSE¯=1.40[37]	–	–	–
Atom-based	KB-IFP	[38]	√			–	–	–	–	–	–
Atom-based	PADIF	[40]			√	–	–	–	–	–	–
Atom-pair-counts -based^c	APC (RF-Score)	[42]		√		–	–	–	v2007	RF	R=0.776/ SD=1.58
	EAPC (CScore)	[43]		√		subset of PDBbind (v2009)	NN	R=0.8237/ RMSE=1.0872	v2009	NN	R=0.7768/ RMSE=1.4540
	APCiDB (OnionNet)	[44]		√		–	–	–	v2016 v2013	CNN	R=0.816/ RMSE=1.278R=0.78/ SD=1.45
	RAPCiDB (OnionNet-2)	[45]		√		–	–	–	v2016 v2013	CNN	R=0.864/ RMSE=1.164R=0.821/ RMSE=1.357
	ECIF	[46]		√		–	–	–	V2016	GBDT	R=0.857/ RMSE=1.193
Multi-interaction -based	APIF	[47]		√		–	–	–	–	–	–
	Pharm-IF	[48]		√		–	–	–	–	–	–
	TIFP	[20]			√	–	–	–	–	–	–
Substructure -based	IASF	[22]			√	–	–	–	–	–	–
	SPLIF	[57]	√	√		–	–	–	–	–	–
	PLEC FP	[58]	√	√		–	–	–	v2016 v2013	NN	R=0.82R=0.77
	PrtCmm IFP	[59]	√	√		–	–	–	v2019	RF	R=0.793/ RMSE=2.014

GBDT: gradient boosting decision tree, RF: random forest, NN: neural network, CNN: convolutional neural network.

R: Pearson’s correlation between predicted and experimental affinities, RMSE: root-mean-square error, SD: standard deviation.

APC: atom-pair counts, EAPC: evolved APC, APCiDB: atom-pair counts in distance bins, RAPCiDB: residue-atom-pair counts in distance bins.

Table 2

Several toolkits offering the use of IFPs.

Toolkit	Online address	IFP type	Ref	BAP works
IChem	http://bioinfo-pharma.u-strasbg.fr/labwebsite/download.html	Residue-based	[61]	-
OEChem	https://www.eyesopen.com/oechem-tk	Residue-based	[29]	-
PyPLIF	http://code.google.com/p/pyplif	Residue-based	[26]	-
MOE	https://www.chemcomp.com/Products.htm	Residue-based	[62]	-
ODDT	https://github.com/oddt/oddt	Substructure-based (PLEC FP)	[63]	[58]

An overview of different IFP models and the related scoring tasks. GBDT: gradient boosting decision tree, RF: random forest, NN: neural network, CNN: convolutional neural network. R: Pearson’s correlation between predicted and experimental affinities, RMSE: root-mean-square error, SD: standard deviation. APC: atom-pair counts, EAPC: evolved APC, APCiDB: atom-pair counts in distance bins, RAPCiDB: residue-atom-pair counts in distance bins. Several toolkits offering the use of IFPs.

Target-specific scoring

Several frequently studied targets (Table 3) were selected from PDBbind for target-specific scoring (details of data sets in Supplementary file). IFP Scores were constructed through the consociation between IFP models and machine-learning algorithms.

Table 3

Datasets in PDBbind database for scoring tasks.

Task	Target protein	Number of protein–ligand complexes	Affinity range (-logKd/Ki)
Target-specific scoring	HIV-1 protease	301	[3.9,12.7]
	BETA-SECRETASE 1	326	[2.4,10.77]
	BROMODOMAIN-CONTAINING PROTEIN 4	176	[2.22,9.15]
Generic scoring	Multiple (refined set)	4852	[2.0,11.92]
Generic scoring	Multiple (Core Set)	285	[2.07,11.82]

Representative IFPs from the five classes (Section 2) were investigated. Each type of IFPs was generated for protein–ligand complexes in each target-specific scoring task. When it comes to generating IFPs, interacting atoms are first detected by a distance threshold (), and different thresholds were used in the original IFP works (Table 4). Then pharmacophoric properties of these interacting atoms and specific geometric criteria were used to identify key interactions for IFP construction. Residue-based and atom-based IFPs were generated as binary strings, while the others as integer vectors.

Table 4

IFPs for constructing target-specific scoring functions.

IFP	tint(A˚)	Key interactions	pharmacophoric properties & geometric criteria
SIFt1	4.5	Contact, main-chain atom, side-chain atom, polar, nonpolar, hydrogen-bond donor/acceptor	[20], [25]
SIFt2	4.5	Hydrogen-bond donor/acceptor, hydrophobic, polar, nonpolar, aromatic (face-to-face), aromatic (edge to face), metal-acceptor	[20], [25]
HIF	10	Hydrogen bonds	[20], [38]
CIF	10	Close contacts	[20], [38]
CHIF	10	Hydrogen bonds and close contacts	[20], [38]
APC	12	Atom-pair counts	[42]
APCiDB	30.5	Atom-pair counts in distance bins	[44]
ECIF	6	Extended connectivity interaction features	[46]
APIF	10	Pairwise interactions (hydrophobic, hydrogen-bond donor/acceptor)	[20], [47]
SPLIF	4.5	Implicitly encodes all possible local interactions (Rprotin=Rligand=1)	[57]
PLEC FP	4.5	Implicitly encodes all possible local interactions (Rprotin=5, Rligand=1)	[58]
PrtCmm IFP	4.5	Implicitly encodes all possible local interactions (Rprotin=Rligand=1)	[59]

IFPs for constructing target-specific scoring functions. The above IFPs were fed into three classic machine-learning algorithms (RFs [64], GBDTs [65] and regression trees [66]) to build IFP Scores. For each target, the corresponding dataset was randomly partitioned into training, test and validation sets (70%:15%:15%). Training and test sets were used for model training and parameter tuning, and the best model was further evaluated on the validation set. Due to the random partitions, this process was repeated five times and the average performance (corr and RMSE) was reported. In the model-training stage, key parameters were tuned as follows.These parameters led to models not too complex for the small target-specific datasets. Substructure-based IFPs: the length of the fingerprint from to (), RFs: the number of tree members from 5 to 100 at a step of 5, GBDTs: the boosting stages from 5 to 100 at a step of 5, Regression trees: the maximum depth d from 2 to 50 at a step of 2. Datasets in PDBbind database for scoring tasks. The performances of different IFP Scores in the three target-specific tasks are displayed in Fig. 3. The scoring performances vary with different target proteins. Better performances have been observed on the BETA-SECRETASE 1 dataset, and worse performances on the BROMODOMAIN-CONTAINING PROTEIN 4 dataset. This implies the dependence of machine-learning modeling on sufficient training data, as the BETA-SECRETASE 1 dataset has more protein–ligand complexes and the BROMODOMAIN-CONTAINING PROTEIN 4 dataset less complexes. On the other hand, the performances vary with different machine-learning methods. RFs and GBDTs perform better than regression trees, and this partly explains the extensive use of these two methods in earlier scoring works [32], [37], [42], [46], [59].

Fig. 3

Performances of IFP Scores in three target-specific tasks (targets: HIV-1 protease, BETA-SECRETASE 1, BROMODOMAIN-CONTAINING PROTEIN 4). These IFP Scores were constructed by associating an IFP model (SIFt1, SIFt2, CIF, HIF, CHIF, APC, APCiDB, ECIF, APIF, SPLIF, PLEC FP or PrtCmm IFP) and a machine-learning method (RFs, GBDTs or regression trees). Performances are evaluated using Pearson’s correlation (upper panels) and RMSE (lower panels) between the predicted and experimental affinities. As shown in Fig. 3, given a target protein and a machine-learning method, the performances of IFP Scores largely depend on the IFP models. Generally, IFP Scores originating from atom-pair-counts-based and substructure-based IFPs perform better in these tasks. By averaging the results for the three tasks, we derived Fig. 4. Here the atom-pair-counts-based and substructure-based IFP Scores perform the best, followed by the APIF-based, SIFts-based and atom-based IFP Scores. Compared via Pearson’s correlations among each type of machine-learning models, PLEC FP-RF Score (), ECIF-GBDT Score () and APC-TREE Score () perform the best.

Fig. 4

Average performances of IFP Scores in three target-specific tasks (targets: HIV-1 protease, BETA-SECRETASE 1, BROMODOMAIN-CONTAINING PROTEIN 4). Pearson’s correlations and RMSEs between the predicted and experimental affinities are presented in the left and right panels, respectively.

Generic scoring

Residue-based and atom-based IFPs are not applicable to generic scoring. Representative IFPs from the other three classes were adopted in this task, and these IFPs were generated as integer vectors. PDBbind Refined Set was partitioned into training and test sets (90%:10%) for model training and parameter-tuning, and the best model was evaluated on the Core Set (Table 4, details of data sets in Supplementary file). Overlapping complexes with the Core Set were removed from the refined set to provide a fair validation. Analogous to target-specific scoring tasks, RFs, GBDTs and regression trees were employed to construct IFP Scores. However, due to the increased data, the parameters of these models were tuned in broader ranges as follows. RFs: the number of tree members from 300 to 700 at a step of 100, GBDTs: the boosting stages from 300 to 700 at a step of 100, Regression trees: the maximum depth d from 10 to 100 at a step of 5. The results are displayed in Fig. 5. As expected, IFP Scores based on regression trees underperform those based on RFs or GBDTs. Given a machine-learning method, the IFP Scores originating from different IFP models behave similarly as in the target-specific scoring tasks. Atom-pair-counts-based and substructure-based IFP Scores perform better than APIF Score, and the best performances of these two classes belong to APCiDB-RF Score () and PrtCmm-RF Score ().

Fig. 5

Performances of IFP Scores in the generic scoring task (multiple targets). The performances are evaluated on PDBbind v2019 Core Set, with Pearson’s correlations and RMSEs between the predicted and experimental affinities presented in the left and right panels.

Summary and outlook

BAP for protein–ligand complexes from their X-ray crystal structures is an important but challenging problem in computer-aided drug design. Plentiful descriptors, which differ in granularity and representation, have been developed and fed into machine-learning algorithms to form BAP-oriented SFs. Recently, the extensive use of deep-learning techniques in various areas also boosted the developments of deep-learning SFs. However, these SFs can only improve the scoring performance marginally, let alone the increasingly complex representations of the descriptors they use [46]. Contrarily, IFPs are simple-format descriptors that carry key protein–ligand interactions, and they can collaborate with classic machine-learning algorithms to form competitive SFs for BAP. In this paper, we reviewed a wide range of IFPs following a building-block-based taxonomy, and compared representative IFP Scores in target-specific and generic scoring tasks. Validated on PDBbind v2019 datasets, atom-pair-counts-based and substructure-based IFP Scores performed well, demonstrating their potential in BAP problems. One limitation of IFP methods is that they rely on the availability of protein–ligand complex structures [21]. However, with the structure-determination techniques (experimental or in silico) becoming more mature, abundant structures have been produced to support the development of IFP methods. Early IFP methods (e.g. SIFts, KB-IFPs and APIFs) use pre-defined interaction types and empirical geometric criteria to detect protein–ligand interactions, which restricts their ability to cover more interactions. Substructure-based IFPs can implicitly account for all types of local interactions, which makes them more competitive in scoring works. Meanwhile, the encoding step in IFP methods has evolved from the interaction-type-based form to a more manageable form, with the introduction of a hash procedure. Altering the basic atom properties, substructure definition and/or encoding ways in these models can potentially improve the scoring performances further. Atom-pair-counts-based IFPs are arguably the simplest feature representations in BAP. They can be easily and rapidly generated, and often lead to good scoring performances. However, these models rely on predefined interaction types (e.g. C–C, C-O, C-S), which may need to be adapted for different training data. Besides, refining the interaction types may promote them to be more promising in scoring works. At last, the performances of IFP Scores (machine-learning-based) are subject to the types and parameters of the involved machine-learning models. Altering the model types or parameters will probably results in fluctuations of the scoring performances. Fine-tuning the parameters in model-training stage using a wider range of values may potentially improve the model performance.

Funding

This work was support by Hong Kong Innovation and Technology Commission and Hong Kong Research Grants Council (Projects 11200818 and UGC/FDS16/M08/18).

CRediT authorship contribution statement

Debby D. Wang: Conceptualization, Methodology, Software, Writing - original draft. Moon-Tong Chan: Software, Investigation, Validation. Hong Yan: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

58 in total

Review 1. Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening.

Authors: L Xue; J Bajorath
Journal: Comb Chem High Throughput Screen Date: 2000-10 Impact factor: 1.339

2. APIF: a new interaction fingerprint based on atom pairs and its application to virtual screening.

Authors: Violeta I Pérez-Nueno; Obdulia Rabal; José I Borrell; Jordi Teixidó
Journal: J Chem Inf Model Date: 2009-05 Impact factor: 4.956

3. Atom-centered interacting fragments and similarity search applications.

Authors: José Batista; Lu Tan; Jürgen Bajorath
Journal: J Chem Inf Model Date: 2010-01 Impact factor: 4.956

4. A fast flexible docking method using an incremental construction algorithm.

Authors: M Rarey; B Kramer; T Lengauer; G Klebe
Journal: J Mol Biol Date: 1996-08-23 Impact factor: 5.469

5. Proteo-chemometrics interaction fingerprints of protein-ligand complexes predict binding affinity.

Authors: Debby D Wang; Haoran Xie; Hong Yan
Journal: Bioinformatics Date: 2021-02-26 Impact factor: 6.937

Review 6. Polypharmacology rescored: protein-ligand interaction profiles for remote binding site similarity assessment.

Authors: Sebastian Salentin; V Joachim Haupt; Simone Daminelli; Michael Schroeder
Journal: Prog Biophys Mol Biol Date: 2014-06-09 Impact factor: 3.667

Review 7. Perspectives on NMR in drug discovery: a technique comes of age.

Authors: Maurizio Pellecchia; Ivano Bertini; David Cowburn; Claudio Dalvit; Ernest Giralt; Wolfgang Jahnke; Thomas L James; Steve W Homans; Horst Kessler; Claudio Luchinat; Bernd Meyer; Hartmut Oschkinat; Jeff Peng; Harald Schwalbe; Gregg Siegal
Journal: Nat Rev Drug Discov Date: 2008-09 Impact factor: 84.694

8. Machine learning on ligand-residue interaction profiles to significantly improve binding affinity prediction.

Authors: Beihong Ji; Xibing He; Jingchen Zhai; Yuzhao Zhang; Viet Hoang Man; Junmei Wang
Journal: Brief Bioinform Date: 2021-09-02 Impact factor: 11.622

9. Structural protein-ligand interaction fingerprints (SPLIF) for structure-based virtual screening: method and benchmark study.

Authors: C Da; D Kireev
Journal: J Chem Inf Model Date: 2014-08-20 Impact factor: 4.956

10. Constructing and Validating High-Performance MIEC-SVM Models in Virtual Screening for Kinases: A Better Way for Actives Discovery.

Authors: Huiyong Sun; Peichen Pan; Sheng Tian; Lei Xu; Xiaotian Kong; Youyong Li; Tingjun Hou
Journal: Sci Rep Date: 2016-04-22 Impact factor: 4.379