Literature DB >> 26628925

A note on utilising binary features as ligand descriptors.

Hamse Y Mussa¹, John B O Mitchell², Robert C Glen³.

Abstract

It is common in cheminformatics to represent the properties of a ligand as a string of 1's and 0's, with the intention of elucidating, inter alia, the relationship between the chemical structure of a ligand and its bioactivity. In this commentary we note that, where relevant but non-redundant features are binary, they inevitably lead to a classifier capable of capturing only a linear relationship between structural features and activity. If, instead, we were to use relevant but non-redundant real-valued features, the resulting predictive model would be capable of describing a non-linear structure-activity relationship. Hence, we suggest that real-valued features, where available, are to be preferred in this scenario.

Entities: Gene

Keywords: Bernoulli distribution; Binary descriptors; Ligand chemical structure; Linear relationship

Year: 2015 PMID： 26628925 PMCID： PMC4665894 DOI： 10.1186/s13321-015-0105-3

Source DB: PubMed Journal: J Cheminform ISSN： 1758-2946 Impact factor: 5.514

Background

One of the major goals of cheminformatics is to predict the relationship between a ligand’s chemical structure and its bioactivity [1]. If this relationship is captured correctly, then (among other goals) designing the right drug for each disease would become an easier task [1, 2]. Unfortunately, the structure-activity relationship can often be intricate and arcane, and in particular non-linear. To devise an adequate model describing this relationship, the cheminformaticist typically follows a standard approach; starting with a large number of ligand attributes or features considered important for representing the underlying characteristics of the ligand, and relevant to its bioactivity. Then, through feature selection techniques, one selects the ligand attributes deemed to have statistically minimum interdependence among themselves (given the ligand bioactivity), while also showing strong association with the ligand bioactivity [3-5]. With this step, one strives for a set of relevant but non-redundant ligand features [4, 5]: “relevant” in the sense that there is a strong association between the selected features and the bioactivity, and “non-redundant” in the sense that these features are conditionally independent given the bioactivity. (Irrelevant features are basically noise and relevant but redundant features are nuisance [6]; we are not concerned with these features here [6]). Typically the ligand’s chemical structure is represented by an L-dimensional vector . The elements ideally contain appropriate information about the ligand’s features, relevant for predicting its bioactivity. This bioactivity against a particular target or protein may be represented either numerically or as a class label; such classes (or class labels) are denoted henceforth by k, where k = 1, 2, ..., K with K being the total number of classes of interest. Identifying the relevant features x without errors is generally impossible. Usually both x and k are treated as random variables such that for a given x we have a distribution —the so-called class posterior probability—on the different possible classes [1, 7]. In practice, that can assign a new ligand represented by x to the class minimising the probability of misclassification is induced from given prototype samples (a training dataset) [8, 9]. In Bayesian probabilistic settings, it is usually computationally easier to estimate in terms of class probability (p(k)), evidence () and class-conditional probability density function ():In cheminformatics, the main task of estimating often reduces to inducing from the training dataset.

Commentary

It is common practice nowadays to assume that the L relevant chemical structure features of the ligand can be encoded as a binary “vector” of 1’s and 0’s denoting presence (1) and absence (0) of these features—i.e., [10]. In practice, state-of-the-art feature selection techniques [3, 5] that are based on information theory are used to quantify the level of association between the features and the bioactivity. These techniques are also capable of quantifying the class-conditional interdependency among the features. However, in the light of the insightful work of Li on the peculiar but useful characteristics of the conditional dependence between two binary random variables [11], one might be able to go one step further; identify the features in the L relevant features whose relationship with the bioactivity is statistically significant, but whose class-conditional interdependency is statistically insignificant—i.e., retain features that are statistically non-redundant (and for that matter ignore or discard statistically redundant features). In our probabilistic setting, relevant descriptors being non-redundant entails that can be expressed as a product of class-conditional univariate probability density functions , i.e., . This means that , which is what we are interested in estimating, can be given asSince , the univariate distributions are Bernoulli [8, 12, 13], i.e. . In terms of these Bernoulli distributions, Eq. 2 modifies towhich can be further rewritten in an equivalent but more convenient form (see Chapter 4 of ref [8]):where ; . Clearly, the discriminant function is linear in x [8, 12, 13]—irrespective of the nature of the association between the chemical structure of the ligand and its bioactivity. This is the consequence of the ligand’s relevant but non-redundant features being represented by a binary “vector”. However, the situation can be different if non-redundant real-valued features are utilised to represent the chemical structure of the ligand. In this scenario the class-conditional univariate distributions are not necessarily Bernoulli. Here can be expressed in Hermite polynomial basis functions in variable where are the appropriate coefficient values. Note that the k in and is just an index (not a power). Inserting Eq. 5 into Eq. 2 and then taking the logarithm of the resultant equation yields the following discriminant functionwhere . Clearly is not necessarily linear in x even though the features utilised are class-conditionally independent [13]. Thus, for real-valued features, the resulting classifier is capable of representing a non-linear structure-activity relationship.

Conclusions

In this commentary it has been noted that, when ligand features are represented by a string of binary numbers, one must end up with a linear model for describing the dependency (if any) between the chemical structure of a ligand and its bioactivity of interest—albeit in a classification setting. Such a linear model may be severely biased and limited in its predictivity. It was also pointed out that, where relevant real-valued features are used, the resulting model can be unbiased as it can adequately capture both linear and non-linear structure-activity relationships.

6 in total

A note on utilising binary features as ligand descriptors.

Background

Commentary

Conclusions

1. Application of the mutual information criterion for feature selection in computer-aided diagnosis.

2. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.

3. Using mutual information for selecting features in supervised neural net learning.

4. Verifying the fully "Laplacianised" posterior Naïve Bayesian approach and more.

5. A multi-label approach to target prediction taking ligand promiscuity into account.

6. The Parzen Window method: In terms of two vectors and one matrix.