Literature DB >> 35315648

Graph-Based Feature Selection Approach for Molecular Activity Prediction.

Gonzalo Cerruela-García¹, José Manuel Cuevas-Muñoz¹, Nicolás García-Pedrajas¹.

Abstract

In the construction of QSAR models for the prediction of molecular activity, feature selection is a common task aimed at improving the results and understanding of the problem. The selection of features allows elimination of irrelevant and redundant features, reduces the effect of dimensionality problems, and improves the generalization and interpretability of the models. In many feature selection applications, such as those based on ensembles of feature selectors, it is necessary to combine different selection processes. In this work, we evaluate the application of a new feature selection approach to the prediction of molecular activity, based on the construction of an undirected graph to combine base feature selectors. The experimental results demonstrate the efficiency of the graph-based method in terms of the classification performance, reduction, and redundancy compared to the standard voting method. The graph-based method can be extended to different feature selection algorithms and applied to other cheminformatics problems.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35315648 PMCID： PMC9006223 DOI： 10.1021/acs.jcim.1c01578

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

In the construction of quantitative structure–activity relationship (QSAR) models based on classification or regression techniques, the preprocessing step is a fundamental component to avoid the use of data that yield an identical effect, no effect, or even a deceptive effect.[1] Feature selection is one of the most common tasks used in this preprocessing step. The selection of an optimal set of features from which a model can achieve maximum performance is a nondeterministic polynomial problem (NP). The objective of feature selection is to eliminate, as much as possible, the amount of irrelevant and/or redundant features to improve the performance of the prediction algorithms, reducing the negative effects related to high dimensionality, accelerating the learning process, and improving the generalization and interpretability of models. Feature selection methods can rank individual features according to their importance (ranking methods) or evaluate complete sets of features to select an optimal subset (feature subset selection methods). This paper is only concerned with the latter. From a taxonomic point of view, feature selection methods are traditionally divided into four categories: (i) filter methods, (ii) wrapper methods, (iii) embedded methods, and (iv) hybrid methods.[2] Filters methods select the features regardless of the algorithm used in building the model. A large number of filter methods have been described in the literature, and among the most used are the following: information gain,[3] gain ratio,[4] minimum redundancy, maximum relevance,[5] Chi-square,[4] fast correlation-based filter,[6] correlation-based feature selection,[4] Fisher score,[7] fast clustering-based feature selection (FAST),[8] and Relief or ReliefF.[9] Wrappers methods choose the optimal subset of features for evaluating the performance of the modeling algorithm as if it were a black box. Wrappers require a higher computational cost compared to the filter methods. Furthermore, the subsets of features are biased toward the modeling algorithm used in the evaluation. For this reason, the use of independent validation samples is necessary for the reliable estimation of the error. Embedded methods integrate the selection of features within the modeling algorithm, either as part of the predictive/descriptive method or as an extended functionality, and thus, the selection of features is accomplished during the execution of the modeling algorithm. Hybrid methods combine the advantages of filter and wrapper methods. Usually these methods initially apply a filter method to reduce the number of features, obtaining in many cases several possible subsets. A wrapper method is then used to obtain the best subset of features. Previous literature on the construction of QSAR models shows that the most used feature selection methods are the following: chi-square (CS),[10,11] gain ratio (GR),[12,13] information gain (IG),[14,15] unbalanced correlation score (UCS),[7] mutual information (MI),[16,17] standard correlation score (Fisher score, FS),[18,19] F-score (FS) base ranking,[20−22] Shannon entropy (SE),[23,24] recursive feature elimination (RFE),[25−28] and the fast clustering-based feature selection algorithm (FAST).[29] Ensemble approaches, based on bagging and/or boosting, have been proposed for feature selection.[30] These methods have two configurable components: the boosting scheme and the base feature selection algorithm to be used. Ensembles of feature selectors are constructed by repeatedly applying feature selection algorithms and then combining their results. Ensembles of feature selectors focused on overcoming class imbalance problems have also been proposed.[31] In the construction of feature selection ensembles, the combination of the results of the different base selectors is crucial.[32] The set of methods for combining feature subset selectors is usually limited to take into account the result of applying each feature selector by storing in a vector the number of times that each feature was selected; this vector is used to obtain the final selection. The most straightforward methods are the intersection or union of feature sets. However, both of them produce poor performance. The intersection often returns very small sets and thus results in poor performance. The union generally achieves better performance, but with the drawback of almost negligible reduction. A more efficient solution is to use a vote threshold to obtain sets of features based on the classification performance[30] or subject to a data complexity measure.[33] The main drawback of these three approaches is that they disregard relationships between features considered by the individual application of the feature selection. To solve this problem, in this work, we use an approach[34] based on a graph where the nodes represent the features and the links represent the features co-occurrence in the same use of the algorithm, rather than storing the repeated application of different selectors of features as a vote vector. In this way, the method considers how many times a feature was selected and also considers the sets of features that are selected together each time the algorithm is used. The rest of this work has been organized as follows: Section 2 describes the data set characteristics and molecular representation, the graph-based feature selection method, and the experimental setup. Section 3 describes the experimental results, and finally, Section 4 provides a summary of the conclusions of this work.

Material and Methods

In this section, we discuss the methodology used in this work, the data set, and the algorithms used.

Data Set Characteristics and Molecular Representation

In our study, the data were collected from different sources to yield a total of 24 data sets previously used for the construction of binary prediction models for different molecular targets. Each molecule in the data sets was represented using GSFrag,[35,36] which considers 1138 molecular fragments (247 GSFrag + 891 GSFragl), with the fragments consisting of one or more disconnected components. Each component considers, among other factors, paths of length n, cycles on m vertices, or paths (cycles) with a number of attached chains of unit length. In the construction of QSAR models, the diversities of the data sets play fundamental roles for the generalization of the models. Thus, models built from small or homogeneous compound sets offer poor generalization capacity.[37−39] Recently González-Medina et al.[40] proposed a new approach to study the diversity of molecular databases from different perspectives, including fingerprint-based diversity and the diversity of physicochemical properties. In our work, we have studied the diversity of the data sets from four perspectives: (i) fingerprint-based diversity, (ii) diversity of physicochemical properties, (iii) minority class ratio diversity, and (iv) data set size diversity (number of compounds in the data set). To evaluate the fingerprint-based diversity (FpSim), we used the Tanimoto similarity index calculated from the topological fingerprint (Morgan/Circular Fingerprints, radius = 2)[41] for all pairs of molecules in each of the data sets. The diversity of physicochemical properties was calculated using two properties: the octanol/water partition coefficient (ALogPS) and molecular weight (MW). Fingerprint and physicochemical properties were calculated with the RDKit library.[42] Table summarizes the characteristics of the data sets. The information shown in the table includes a unique identifier for each data set, the number of total molecules, the number of active/inactive elements (positive/negative class), the coefficient of variations for ALogPS, MW, and FpSim, and a description of the molecular pathway end point. The coefficient of variation (CV) was defined as follows:with σ being the standard deviation, μ the mean, and Me the parameter under study (ALogPS, MW, FpSim).

Table 1

Data Set Characteristics

Data set	No. molecules	Class –	Class +	CV (ALogPS)	CV (MW)	CV (FpSim)/Avg (FpSim)	Description	ref
DS1	432	155	277	0.430	0.187	0.295/0.408	Inhibitors of factor Xa of the benzamidine Family	(43)
DS2	311	242	69	0.351	0.205	0.299/0.528	Inhibitors of c-Jun N-terminal Kinase-3	(38)
DS3	534	337	197	2.058	0.486	0.227/0.363	Plasmodium falciparum growth inhibitor assay	(44)
DS4	780	409	371	0.659	0.405	0.269/0.390	Molecules set versus Mycobacterium tuberculosis	(45)
DS5	1510	820	690	0.295	0.233	0.234/0.424	Inhibitors of human β secretase 1	(46), [47]
DS6	1880	639	1241	0.494	0.283	0.235/0.366	P-glycoprotein inhibitors	(48)
DS7	483	241	242	0.740	0.474	0.259/0.383	P-glycoprotein substrates	(48)
DS8	567	260	307	0.658	0.390	0.232/0.382	Chembench: 313_MDR1	(49)
DS9	426	201	225	0.489	0.386	0.304/0.405	Chembench: 322-MRP1i10	(49)
DS10	122	61	61	1.805	0.363	0.288/0.402	Chembench: 342_MRP4x	(49)
DS11	70	35	35	0.774	0.273	0.312/0.397	Chembench: 412_NTCPx	(49)
DS12	82	41	41	2.691	0.405	0.216/0.384	Chembench: 422_OCT1x	(49)
DS13	292	139	153	1.701	0.392	0.261/0.431	Chembench: 24_PEPT1x	(49)
DS14	1219	610	609	0.679	0.350	0.212/0.361	Chembench: 151305_ebola_1224cpds_PCM4	(49)
DS15	171	64	107	0.273	0.163	0.345/0.498	Chembench: ack1	(49)
DS16	289	146	143	0.473	0.234	0.255/0.370	Chembench: BetaLactamase_Dataset_Vini	(49)
DS17	3823	1951	1872	0.256	0.181	0.214/0.374	Chembench: D2_improved_eugene	(49)
DS18	320	182	138	0.331	0.194	0.361/0.385	Chembench: IE_M1_Descriptors	(49)
DS19	369	278	91	0.970	0.628	0.375/0.310	Chembench: Ld50_impress_JRC	(49)
DS20	1919	864	1055	0.331	0.179	0.242/0.401	Chembench: IE_5-HT6_Descriptors	(49)
DS21	1290	154	1136	0.798	0.476	0.310/0.337	Estimation of aqueous solubility	(50), (51)
DS22	806	433	373	0.589	0.318	0.211/0.365	Human Ether-à-go-go-Related Gene	(51)
DS23	4054	2362	1692	0.327	0.213	0.199/0.356	Pubchem BioAssay: AID 2044	(51)
DS24	19737	19562	175	0.350	0.233	0.201/0.356	Malaria (Plasmodium falciparum)	(51)

In the Supporting Information, Figure S1a and b shows the distribution of physicochemical properties ALogPS and MW in each data set, while Figure S1c shows the cumulative distribution function using the pairwise similarity values of the compounds in each data set, where both representations exhibit the diversity of the data sets used. The similarity cumulative distribution function using the pairwise Tanimoto similarity with the Morgan fingerprint shows pariwise similarity values lower than 0.5 in 80% of the cases, highlighting the structural diversity of the molecules. Moreover, CV (FpSim) (Table ) shows values equal to or less than 0.3 for a large majority of the data sets, indicating that the mean of the pairwise fingerprint (Avg (FpSim)) is representative of the data set. Considering the 24 data sets, the values of Avg (FpSim) range from 0.31 to 0.53, which confirms the structural diversity mentioned above. In terms of molecular properties, the CV (ALogPS) and CV (MW) values show greater diversity in terms of ALogPS compared to MW. Moreover, the selected data sets present a wide range in terms of the number of compounds, with the minimum size of 70 and a maximum size of 19,737. The balance between the classes (active/inactive) also presents great diversity, with the percentage of minority classes within the range from 50% (perfectly balanced) to 0.8% (highly unbalanced). Figure shows a more comprehensive multifactorial representation of data set diversity, which simultaneously represents the diversity of chemical data sets by fingerprint (x axis), physicochemical properties (y axis), minority class ratio (color), and data set size (mark size).

Figure 1

Multifactorial data set diversity representation: fingerprint (x axis), physicochemical properties (y axis), minority class ratio (color), and data set size (mark size).

Graph-Based Feature Selection Method

The feature selection ensemble approach used in this paper is based on boosting.[30] In every feature selection boosting step, the selection is usually recorded by casting a vote for each selected feature. These votes are weighted when the corresponding boosting algorithm uses weighted classifiers as members of the ensemble. Once the Tr rounds are finished, a vector of votes is obtained that records how many times every feature has been selected, and a final selection is determined using that vector.[30] This approach does not take into account the relationship among the features. The algorithm utilized in this work uses a different approach based on an undirected graph in the ensemble construction.[34] The first step of the algorithm consists of the construction of an undirected graph that allows storing the results of each feature selection process. For this, the nodes of the graph are used to store the features, and the edges represent the concurrent selection of the two features in the same application of the algorithm. In the second step, this graph is used to select a group of features from the selection, which is performed for every step of the ensemble construction process. Figure shows the graph-based feature selection algorithm. The first step stores the feature selection results using an undirected graph G(V, E), representing the vertices V = (1, 2, ..., M), the features Φ = ϕ1, ϕ2, ...ϕ, and the links representing the selection of the two features simultaneously. For simplicity, we consider that the vertex i coincides with the feature ϕ and is assigned the value v. Moreover, a member (i, j) of the set E corresponds to a link between i and j with an e(i, j) value.

Figure 2

Graph-based feature selection algorithm.

Graph-based feature selection algorithm. Once a feature selection method is applied in the ensemble construction (FAST method[8] in this work), the vertices corresponding to the selected features and the edges linking these vertices are increased by one. If the boosting method is used for feature selection,[30,52] the votes used for the tth member of the ensemble can be weighted by a value α; in this way, the value is increased by α instead of increasing by one. Finally, the vertex values in the created graph reflect how many times each feature was selected and the link values how many times two features were selected together. The method is based on the assumption that the features that represent the nonredundant information are those that are most frequently selected together and not those that rarely occur. The second step of the algorithm uses the created graph to select a set of features; for this purpose, it is necessary to establish two thresholds: the first for the vertices (τϕ) and the second for the edges (τϵ). Using these thresholds, it is possible to select a set of features Φ′(τϕ, τϵ) by applying different strategies. To select a subset of Φ′(τϕ, τϵ) features, these two terms were evaluated using reduction, r(Φ′(τϕ, τϵ)), and the performance in classification, measured by Cohen’s κ value, κ(Φ′(τϕ, τϵ)). Thus, considering m selected features from all M features set, the reduction was measured as , and both metrics were combined using the following equation The highest performance, J, is chosen by evaluating all possible vertex and edge threshold pairs. As possible thresholds, all the different values for vertices and edges or a fixed number can be considered, and in this case, the values can be divided into equal subintervals. Once the values for a pair (τϕ, τϵ) have been set, the feature selection process proceeds in the following way: initially, for the vertex v with a value greater than or equal to τϕ, the features ϕ are selected. Then, for every pair of selected features (ϕ, ϕ), if the corresponding edge, e(i, j), is below the edge threshold, e(i, j) < τϵ, feature ϕ is removed if v < v or feature ϕ is removed otherwise. With a vertex threshold of 0, the method includes the particular case of a standard voting scheme in which all voting combinations are considered. Using this first strategy, two variants of the algorithms are considered: (i) the mT variant, where τϕ = 0 and τϵ ≠ 0, and (ii) the MmT variant, where τϕ ≠ 0 and τϵ ≠ 0. These two methods are shown in the left-hand side of the algorithm shown in Figure . The second strategy of the algorithm (right-hand side in step 2 of the algorithm, Figure ) is based on forming a chain of features using the following procedure. First, the feature corresponding to the vertex with the largest value is selected. Then, for these vertex edges, the largest value above the given threshold τϵ is chosen, and the corresponding feature is added to the set of selected features. In this way, this feature becomes the new starting point, and the process ends when all of the edges above the threshold from the current last member of the chain are linked to already-selected vertices. In this second strategy, two variants were also considered: mTC, for τϕ = 0 and τϵ ≠ 0, and MmTC, for τϕ ≠ 0 and τϵ ≠ 0.

Experimental Setup

The experiments were performed following the protocol shown in Figure S2 of the Supporting Information. Each data set was divided randomly into two disjoint sets: one to build the model and the other to perform the external validation. Thus, following a repeated double cross-validation procedure,[53,54] five external validation rounds (external loop in Figure S2) were completed. The feature selection process was conducted using the subset of molecules that were used to build the model. The graph-based method was evaluated with three different well-known classifiers, a decision tree (DT), a support vector machine (SVM), and a random forest (RF). To set the hyperparameters of the classifiers in the model’s construction (inner loop in Figure S2), we used a 10-fold internal cross-validation process. For each classifier, we identified the best hyperparameter values from a set of possible values. For RF, we used a size of 100 trees and the Gini impurity criterion to measure the quality of a split. The nodes were expanded until all leaves were pure or until all leaves contained less than two samples, and bootstrap samples were used for building the trees. For SVMs, three parameters were set: the kernel type, the C value, and for the Gaussian kernel, the γ value. Thus, we tested a linear kernel with C ∈ {0.1, 1, 10} and a Gaussian kernel with C ∈ {0.1, 1, 10} and γ ∈ {0.0001, 0.001, 0.01, 0.1, 1, 10}. All 21 possible combinations were evaluated. For decision trees, we used 1 and 10 trials with the option of softening the thresholds and tested all four possible combinations. The performance achieved by a classifier was measured using the geometric mean (G-Mean) of sensitivity and specificity,[55] defined aswhere TP, TN, FP, and FN are the true positives, true negatives, false positives, and false negatives, respectively. The use of this metric is recommended for both balanced and unbalanced data sets because it takes into account the uneven distribution of class instances. The reduction capacity for feature selection methods can be defined as followswhere m is the number of selected features, and M denotes the total number of features. Redundancy was evaluated using two different metrics: one based on mutual information and the second based on the correlation.[56] Mutual information can be defined as followswhere X and Y depict feature vector and class vector, respectively, and P(·) represents probability. Consider S to be a vector of a given set of features and h a class variable. The redundancy based on mutual information (MI) was measured aswhere |S| is the number of features in S. Redundancy based on correlation (AcRed) was evaluated by replacing mutual information with a correlation coefficient.[56] For this purpose, the mean (meanC) of the correlation coefficient was calculated. Thus, AcRed was defined as followswhere f, f ∈ S∀i, j = 1, 2, ..., m, and Cor() is a correlation coefficient function and abs() an absolute value function. To guarantee a rigorous comparison between the graph-based feature selection method and the standard method, it is necessary to use specially designed statistical tests to evaluate multiple algorithms on multiple data sets. To do this, it is first necessary to know whether there is a significant difference among the methods. The Iman–Davenport test is recommended to determine the existence of these statistical differences. It is based on the χ2 Friedman test, which compares the average ranks (R) of k algorithms for N data sets, but it is more powerful.[57] After applying the Iman–Davenport test, it is necessary to apply some of the general procedures for controlling the family-wise error in multiple testing. The Holm test[57] is designed to compare in a stepwise manner the algorithm with the best performance in terms of Friedman ranges with the rest of the methods under study. The statistical test for comparing the ith and jth methods is defined as follows Using the normal distribution table, the values of z were used to find the corresponding probability. Step-down procedures sequentially test the hypotheses in order of their significance; the ordered p values were denoted by p1, p2..., such that p1 ≤ p2... ≤ p. The Holm method compares each p with α/(k – 1). The step-down procedure starts with the most significant p value. If p1 is below α/(k – 1), the corresponding hypothesis is rejected, and we compare p2 with α/(k – 1). If the second hypothesis is rejected, the test proceeds to the third and so on. When a null hypothesis cannot be rejected, all remaining hypotheses are retained as well. For all tests, we used a significance level α = 0.05. To compare multiple algorithms, we used the Nemenyi[58] test. This test considers the performances of two algorithms to be significantly different if the corresponding average Friedman’s ranks differ by at least the critical difference. The critical difference (CD) for N data sets and k algorithms was formulated as follows[57,58]

Experimental Results

In order to assess the effectiveness of the proposed feature selection approach in the construction of binary QSAR models for molecular activity prediction, we performed different experiments with the three classifiers mentioned above with respect to 24 data sets with different molecular activities. As stated, the algorithm supports the use of different feature selection methods. We used fast clustering-based feature selection (FAST) in the tests.[8] The standard voting AdaBoost method (T) was used as a reference to make the comparisons. The experimental results are included in Tables S1–S3 of the Supporting Information. In this section, we present a graphical representation (Figures –5) of the results for ease of presentation and discussion. These figures include two types of representation: the first (panel (a) in Figures –5) represents the average values of four metrics G-Mean (y axis), AcRed (x axis), MI (color), and R (mark size), while the second (panels (b)–(e) in Figures –5) represents the performance of two metrics (G-Mean, AcRed) for each data set.[59] To construct this 2D representation, the value of each axis represents the difference of the graph-based methods (MmT, MmTC, mT, mTC) with respect to the base method (T) for the same data set. In this way, the arrows pointing downward-left represent the data sets for which the base T algorithm outperformed our graph-based method for G-Mean and was worse in terms of AcRed, the arrows pointing upward-left indicate that our graph-based method improved the G-Mean and AcRed. The arrows pointing upward-right show data sets for which our graph-based method improved the G-Mean but had an inferior AcRed, and arrows pointing downward-right show the data sets for which the base T algorithm outperformed our graph-based method with respect to both G-Mean and AcRed. In the figures, the values of the differences are represented as percentages.

Figure 3

Four metrics results (a) and the moment diagrams (b–e) representing the differences (G-Mean) of the proposals against the base method (T) for the DT classifier.

Figure 5

Four metrics results (a) and the moment diagrams (b–e) representing the differences (G-Mean) of the proposals against the base method (T) for the SVM classifier.

Four metrics results (a) and the moment diagrams (b–e) representing the differences (G-Mean) of the proposals against the base method (T) for the DT classifier. Four metrics results (a) and the moment diagrams (b–e) representing the differences (G-Mean) of the proposals against the base method (T)) for the RF classifier. Four metrics results (a) and the moment diagrams (b–e) representing the differences (G-Mean) of the proposals against the base method (T) for the SVM classifier. Figure shows the results for the DT classifier. The results in terms of G-Mean (Figure a) showed a better performance for the graph-based methods compared to the base method T. This better behavior in terms of G-Mean was also shown by the number of data sets for which the use of some of the graph-based methods outperformed T. For example, with the application of MmT (Figure b), G-Mean was improved in 17 of the 24 data sets evaluated. MmTC (Figure c) improved it in 19 of them. mT (Figure d) improved it in 20 of them, and mTC (Figure e) improved it in 21 of them. Regarding redundancy, the mean values obtained for MI and AcRed also showed better results (lower values) for the graph-based method compared to the base method (x axis and color scale values in Figure a). According to the distribution of the differences for each data set in terms of AcRed, the best result was obtained with the application of the methods MmTC, mT, and mTC, where 16 of the 24 data sets obtained better results compared to the base method T. Moreover, for reduction (R), the results were very similar for all of the methods. Figure shows the results for the RF classifier. The overall performance for RF in terms of G-Mean was better than that using DTs. Higher average values were obtained, and the advantage of using the graph-based methods with respect to the base method T was retained. In terms of redundancy (MI, AcRed), values very similar to those achieved by DT were obtained, with better performance for graph-based methods regarding T. For RF, the distribution by data set achieved the best results for the mT and mTC methods, which have surpassed the T method by 20 data sets in terms of G-Mean and by 18 data sets in terms of AcRed.

Figure 4

Four metrics results (a) and the moment diagrams (b–e) representing the differences (G-Mean) of the proposals against the base method (T)) for the RF classifier.

For the SVM classifier (Figure ), although the values of G-Mean were lower than RF, they showed similar global behavior to those obtained by both DT and RF, showing that the G-Mean had a better performance with the use of MmT, MmTC, mT, and mTC compared to the application of T. The values of R, MI, and AcRed also behaved in a similar way to the results presented for the DT and RF classifiers, observing a decrease in redundancy (MI, AcRed) when the MmT, MmTC, mT, and mTC methods were applied. Although the results presented so far show the benefits of using the graph-based method compared to the standard method, to obtain conclusive results, the benefits must be validated by means of statistical tests. First, we tested global significant differences using Iman–Davenport. If this test rejected the null hypothesis, we used the Holm procedure to compare the best method with the rest of the methods and then the Nemenyi test to perform a global comparison of all the methods. Tables and 3 show the results of these statistical tests, and Figures and 7 show a graphical representation of these results in order to facilitate comparisons.[57] Holm graphs show the best of the algorithms on the y axis and use a bar graph to represent the p values and a line graph to represent the thresholds. The Nemenyi graphs connect with a horizontal line the groups of algorithms that were not significantly different and show the critical difference in the upper left corner of the graph.

Table 2

Performance (G-Mean) Statistical Tests Resultsa

The best method according to the Holm test is indicated by the thumbs up hand symbol.

Table 3

Reduction and Redundancy Statistical Tests Results for RFa

The best method according to the Holm test is indicated by the thumbs up hand symbol.

Figure 6

Performance statistical test results representation in terms of G-Mean.

Figure 7

Reduction and redundancy statistical test results representation.

The best method according to the Holm test is indicated by the thumbs up hand symbol. The best method according to the Holm test is indicated by the thumbs up hand symbol. Performance statistical test results representation in terms of G-Mean. Reduction and redundancy statistical test results representation. As shown in Table and Figure for the performance of classifiers in terms of G-Mean, the result obtained for the Iman–Davenport test was very close to zero, demonstrating the existence of a significant difference between the evaluated methods. Moreover, as shown in Figure , for all classifiers, the methods selected as the best by Holm test showed significant differences with respect to the base method (T), with no significant differences shown with respect to the rest of the compared methods. The methods selected as the best were the following: mTC for DT, mT for RF, and MmTC for SVM. The Nemenyi test confirmed these results, with the base method (T) performing worst for all classifiers. However a different behavior was observed for DT as the differences were not significant between MmT and T, indicating significant differences of mTC and MmTC with T. RF showed significant differences for all of the methods with respect to T, and SVM did not achieve significant differences among all the compared methods. As shown in Table and Figure , in terms of reduction (R), the results of the Iman–Davenport test showed significant differences. For redundancy, a significant difference was obtained for AcRed but not for MI. The best method in terms of Friedman’s ranks for R and MI was the mTC method, and the best in terms of AcRed was mT. In all cases, the base method T produces the worst results, with a significant difference (below the threshold) in terms of R and AcRed. The results of the Nemenyi test confirmed the worst results obtained by T in terms of R, MI, and AcRed. Moreover, in terms of R, the results did not show a significant difference between the T and MmT methods, with the best performances observed for the mTC and MmTC methods with significant differences compared to T. In terms of AcRed, the best performance was obtained for the mT method, with significant differences with respect to T, and in terms of MI, the results did not show a significant difference. Finally, the experiments were extended evaluating their application to the prediction of toxicity. For this purpose, the benchmark proposed in the Tox21 project[60−62] was used. Figures and 9 show the results of the Nemenyi test in terms of performance and of reduction and redundancy, respectively. The experimental results are included in Tables S5–S9 of the Supporting Information.

Figure 8

Performance Nemenyi test results for TOX-21 benchmark in terms of G-Mean.

Figure 9

Reduction and redundancy Nemenyi test results for TOX-21 benchmark.

Performance Nemenyi test results for TOX-21 benchmark in terms of G-Mean. Reduction and redundancy Nemenyi test results for TOX-21 benchmark. In terms of G-Mean (Figure ), the best results for the DT and RF classifiers were observed for the MmTC, MmT, and mT methods, obtaining significant differences for these methods with respect to the base method T. For the case of the SVM classifier, the best results were obtained for MmTC and MmT methods, with significant differences with the base method T. As Figure shows, in terms of reduction (R), the proposals outperform the base method. In terms of redundancy, the proposals outperform the base method in terms of MI and AcRed. The best results were obtained for the MmT and mT methods.

Conclusions

In this work, we evaluated the application to the prediction of molecular activity of a new feature selection approach, based on the construction of an undirected graph to combine the base application selectors to different features sets. In contrast to the standard voting approach, the method not only considers the frequency which with a feature is selected but also the relationship with other features. Compared with the use of the standard voting method (T), the experimental results for different scenarios that include the DT, RF, and SVM classifiers showed the advantages of the graph-based methods mTC, MmTC, mT, and MmT in terms of classifier performance. Among the graph-based methods, the mTC method showed a better overall performance. One of the main advantages of the graph-based method is that any standard feature selection algorithm can be applied, thus opening new lines of research. Furthermore, the same idea could be adapted to the instance selection problem or the joint selection of features and instances for the construction of QSAR models.

Data and Software Availability

All the data on which the conclusions of the work are based have been exhaustively presented in the manuscript. Data sets DS1–DS24 used in the paper are included as Supporting Information. The toxicity data sets can be download from the following Tox21 project link: https://tripod.nih.gov/tox21/challenge/. The source code under GNU General Public License v3.0 can be downloaded from the following link: http://cib.uco.es/wp-content/uploads/2021/11/source.zip

37 in total

1. Feature selection and transduction for prediction of molecular bioactivity for drug design.

Authors: Jason Weston; Fernando Pérez-Cruz; Olivier Bousquet; Olivier Chapelle; André Elisseeff; Bernhard Schölkopf
Journal: Bioinformatics Date: 2003-04-12 Impact factor: 6.937

2. In silico prediction of chemical toxicity on avian species using chemical category approaches.

Authors: Chen Zhang; Feixiong Cheng; Lu Sun; Shulin Zhuang; Weihua Li; Guixia Liu; Philip W Lee; Yun Tang
Journal: Chemosphere Date: 2014-12-19 Impact factor: 7.086

3. Toward an Optimal Approach for Variable Selection in Counter-Propagation Neural Networks: Modeling Protein-Tyrosine Kinase Inhibitory of Flavanoids Using Substituent Electronic Descriptors.

Authors: Bahram Hemmateenejad; Ahmadreza Mehdipour; Omar Deeb; Mahmood Sanchooli; Ramin Miri
Journal: Mol Inform Date: 2011-11-15 Impact factor: 3.353

Graph-Based Feature Selection Approach for Molecular Activity Prediction.

Introduction

Material and Methods

Data Set Characteristics and Molecular Representation

Graph-Based Feature Selection Method

Experimental Setup

Experimental Results

Conclusions

Data and Software Availability

1. Feature selection and transduction for prediction of molecular bioactivity for drug design.

2. In silico prediction of chemical toxicity on avian species using chemical category approaches.

3. Toward an Optimal Approach for Variable Selection in Counter-Propagation Neural Networks: Modeling Protein-Tyrosine Kinase Inhibitory of Flavanoids Using Substituent Electronic Descriptors.

4. Multi-objective Optimization of Benzamide Derivatives as Rho Kinase Inhibitors.

Review 5. Toxicity testing in the 21st century: bringing the vision to life.

6. Fingerprint-based in silico models for the prediction of P-glycoprotein substrates and inhibitors.

7. Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets.

8. Consensus Diversity Plots: a global diversity analysis of chemical libraries.

9. Development of QSAR machine learning-based models to forecast the effect of substances on malignant melanoma cells.

10. Looking back to the future: predicting in vivo efficacy of small molecules versus Mycobacterium tuberculosis.