Literature DB >> 26925168

Examining the predictive accuracy of the novel 3D N-linear algebraic molecular codifications on benchmark datasets.

César R García-Jacas¹, Ernesto Contreras-Torres², Yovani Marrero-Ponce³, Mario Pupo-Meriño⁴, Stephen J Barigye⁵, Lisset Cabrera-Leyva⁶.

Abstract

BACKGROUND: Recently, novel 3D alignment-free molecular descriptors (also known as QuBiLS-MIDAS) based on two-linear, three-linear and four-linear algebraic forms have been introduced. These descriptors codify chemical information for relations between two, three and four atoms by using several (dis-)similarity metrics and multi-metrics. Several studies aimed at assessing the quality of these novel descriptors have been performed. However, a deeper analysis of their performance is necessary. Therefore, in the present manuscript an assessment and statistical validation of the performance of these novel descriptors in QSAR studies is performed.
RESULTS: To this end, eight molecular datasets (angiotensin converting enzyme, acetylcholinesterase inhibitors, benzodiazepine receptor, cyclooxygenase-2 inhibitors, dihydrofolate reductase inhibitors, glycogen phosphorylase b, thermolysin inhibitors, thrombin inhibitors) widely used as benchmarks in the evaluation of several procedures are utilized. Three to nine variable QSAR models based on Multiple Linear Regression are built for each chemical dataset according to the original division into training/test sets. Comparisons with respect to leave-one-out cross-validation correlation coefficients[Formula: see text] reveal that the models based on QuBiLS-MIDAS indices possess superior predictive ability in 7 of the 8 datasets analyzed, outperforming methodologies based on similar or more complex techniques such as: Partial Least Square, Neural Networks, Support Vector Machine and others. On the other hand, superior external correlation coefficients[Formula: see text] are attained in 6 of the 8 test sets considered, confirming the good predictive power of the obtained models. For the [Formula: see text] values non-parametric statistic tests were performed, which demonstrated that the models based on QuBiLS-MIDAS indices have the best global performance and yield significantly better predictions in 11 of the 12 QSAR procedures used in the comparison. Lastly, a study concerning to the performance of the indices according to several conformer generation methods was performed. This demonstrated that the quality of predictions of the QSAR models based on QuBiLS-MIDAS indices depend on 3D structure generation method considered, although in this preliminary study the results achieved do not present significant statistical differences among them.
CONCLUSIONS: As conclusions it can be stated that the QuBiLS-MIDAS indices are suitable for extracting structural information of the molecules and thus, constitute a promissory alternative to build models that contribute to the prediction of pharmacokinetic, pharmacodynamics and toxicological properties on novel compounds.Graphical abstractComparative graphical representation of the performance of the novel QuBiLS-MIDAS 3D-MDs with respect to other methodologies in QSAR modeling of eight chemical datasets.

Entities: Chemical Disease Gene Species

Keywords: 3D-QSAR; Multiple Linear Regression; QuBiLS-MIDAS; TOMOCOMD-CARDD

Year: 2016 PMID： 26925168 PMCID： PMC4768433 DOI： 10.1186/s13321-016-0122-x

Source DB: PubMed Journal: J Cheminform ISSN： 1758-2946 Impact factor: 5.514

Background

Computational methods that employ statistical and/or artificial intelligence procedures are widely used in the drug discovery process, where the Quantitative Structure–Activity Relationship (QSAR) studies have an important role [1-4]. These studies are based on the principle that the biological activity (or property) of compounds depends on their structural and physicochemical features and thus, are primarily aimed at finding good correlations among molecular features and specific biological activities [5]. In this way, models with high external predictive ability in novel compounds could be built. Right from the works developed by Hansch and Fujita in 1960s [6, 7], considered as the origins of the modern QSAR studies [8], several approaches have been reported in the literature with most of these being 2D-QSAR methods, that is, they only consider the topological structural features of molecules often using matrix representations such as the connectivity and distance matrices [8]. However, with the introduction of the CoMFA [9] methodology in 1988, the 3D-QSAR approaches become popular. These take into account the geometric (3D) features of molecules, which can be computed either from the information represented in a grid through an alignment process with respect to a reference compound or a pharmacophore [2, 10, 11], or using procedures based on Cartesian coordinates [8, 12, 13], molecular spectra [14, 15] and molecular transforms [16], or by the adaptation of 2D methods to take into account three-dimensional (3D) aspects [17-21]. However, despite the number and variety of procedures defined up to date, there exists continued interest in creating or extending the current approaches to more generalized forms in order to codify more relevant chemical information with the aim of yielding QSAR models with better predictive ability. This assertion is in accordance with the Non Free Lunch Theorem [22], which could be interpreted as no single QSAR procedure yields superior predictions than all the others when its performance is averaged over all possible compound datasets. This can be confirmed in a report performed by Sutherland et al. [23], where it is observed how well-established procedures, assessed in eight diverse chemical datasets, present moderate predictions and without significant differences among them (see Additional file 1: Table S1 for a statistical analysis). The justification for this observation is that one family of molecular descriptors (MDs) may not suffice to codify all chemical information and/or molecular properties for different chemical datasets. In other words, the relevance of MDs depends on the nature of the compounds under study. It is therefore necessary to search for alternative methods/approaches to codify novel and orthogonal chemical information. Inspired by the previous idea, recently the 3D N-linear algebraic molecular descriptors have been introduced as a novel mathematical procedure for computing the structural features of chemical compounds [24-26]. These MDs employ the bilinear, quadratic and linear algebraic maps [27] to codify information between atom-pairs by using several (dis-)similarity metrics [25]. Also, the N-linear algebraic forms [28] were used as generalized expressions of the bilinear, quadratic and linear algebraic maps, when relations among three and four atoms are studied [26]. In this way, the geometric matrix [8] was extended to consider for the first time relations for more than two atoms. Several studies aimed at assessing the quality of this novel descriptor family, also called QuBiLS-MIDAS [acronym of Quadratic, Bilinear and N-Linear Maps based on N-tuple Spatial Metric [(Dis)-Similarity] Matrices and Atomic Weightings], were performed and these included an evaluation of the information content (variability) and linear independence using Shannon’s entropy based variability analysis [29] (using IMMAN software [30]) and the principal component analysis (PCA) technique [31], respectively. Also, comparisons with other MDs reported in the literature were performed [25, 26]. In general sense, the results demonstrated that the novel MDs have superior variability than 3D DRAGON indices and another approaches implemented in several software [32-35]. Furthermore, the results revealed that the novel 3D N-linear indices not only do they codify all information contained in the 3D DRAGON MDs, but capture information orthogonal to the latter. Lastly, the QuBiLS-MIDAS MDs were used for modeling the binding affinity to the corticosteroid-binding globulin (CBG), achieving superior results with respect to other QSAR methodologies (see Tables 8–9 in Ref. [25] and Tables 9–10 in Ref. [26]). However, although the initial results with QuBiLS-MIDAS MDs are promissory, it cannot be stated that these are most suitable for building QSAR models for all chemical datasets. It is thus necessary to evaluate the performance of the 3D N-linear algebraic MDs in QSAR modeling with different molecular sets. Therefore, this paper is dedicated to the assessment of the utility of the QuBiLS-MIDAS approach in the prediction of the biological activity in several compound datasets and the comparison of the obtained results with those of other QSAR procedures reported in the literature.

Mathematical overview of the 3D N-linear algebraic molecular descriptors

In this report, the total and local-fragment 3D N-linear Algebraic indices [25, 26] (also known as QuBiLS-MIDAS) are employed to assess the predictive accuracy of this approach in QSAR studies. These molecular descriptors (MDs) are calculated from the contribution of each atom in a molecule. That is, if a molecule is comprised of n atoms then the kthtwo-linear, three-linear and four-linear algebraic indices for each atom “a” are computed as N-linear (Multi-linear) algebraic forms (maps) in , in a canonical basis set, when relations among two (N = 2), three (N = 3) and four (N = 4) atoms are considered, respectively. These descriptors are mathematically expressed as follows:where, “a” is a specific atom (a = 1, 2,…,n), n is the number of atoms in a molecule, (L is the entry corresponding to the contribution of the atom “a” in the vector of atom-level indices (L, F is a local-fragment (group or atom-type) that may or not be considered in the index computation, and x1,…,x, y1,…,y, z1,…,z and w1,…,w are the values (coordinates or components) of the molecular vectors , , and , respectively. In addition, the coefficients , and are the elements of the kthtwo-tuple, three-tuple and four-tuple--spatial-(dis)similarity matrices [, and ], which are obtained from the corresponding kthtwo-tuple, three-tuple and four-tuple-spatial-(dis)similarity matrices [, and ]. Lastly, k (±1, ±2,…,±12) is the power to which the matrix approaches are raised through the Hadamard product. The molecular vectors (or property vectors) , , and are calculated by using the Chemistry Development Kit (CDK) library [36] considering the following fragment- and atom-based properties: atomic mass (m), the van der Waals volume (v), the atomic polarizability (p), atomic electronegativity in Pauling scale (e), atomic Ghose-Crippen LogP (a), Gasteiger-Marsili atomic charge (c), atomic polar surface area (psa), atomic refractivity (r), atomic hardness (h) and atomic softness (s). The total matrix approaches, and constitute the basis for the calculation of the two-linear, three-linear and four-linear indices and these are employed to represent the chemical information codified on interactions among “N” atoms of a molecule. Specifically, for k = 1 (matrix of order 1) the coefficients , and corresponding to the matrices , and can be calculated by using several (dis)-similarity metrics and multi-metrics to capture the information on the relations between two, three and four atoms, respectively [25, 26]. To compute the atom-pair relations, metrics (see Table 1) derived from the general Minkowski definition (e.g. Manhattan, Euclidean) as well as others that have been successfully used in machine learning algorithms and similarity/dissimilarity studies (e.g. Canberra, Soergel, Clark) are employed. On the other hand, different multi-metrics (see Table 2) to calculate the ternary (three) and quaternary (four) relations among atoms of a molecule can be utilized, such as: bond angle for relations among three atoms and dihedral angle for relations among four atoms. Table 3 shows examples of two-tuple and three-tuple total spatial-(dis)similarity matrices calculated with some previously mentioned metrics and multi-metrics.

Table 1

Metrics used to compute the “distance” between two atoms of a molecule

Metrics	Formula^a	Range^b	Average	Range
Minkowski (M1–M7) p = 0.25, 0.5, 1, 1.5, 2, 2.5, 3, and ∞ [where, when p = 1 it is the Manhattan, city-block or taxi distance (also known as Hamming distance between binary vectors) and p = 2 is Euclidean distance)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \left( {\mathop \sum \limits_{j = 1}^{h} \left\| {x_{j} - y_{j} } \right\|^{p} } \right)^{{\frac{1}{p}}}$$\end{document}dXY=∑j=1hxj-yjp1p	[0, ∞)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{{n^{1/p}}}$$\end{document}d¯=dXYn1/p	[0, ∞)
Chebyshev/Lagrange (M8) (Minkowski formula when p = ∞)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = max\left\{ {\left\| {x_{j} - y_{j} } \right\|} \right\}$$\end{document}dXY=maxxj-yj	[0, ∞)		[0, ∞)
Canberra (M10)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \mathop \sum \limits_{j = 1}^{h} \frac{{\left\| {x_{j} - y_{j} } \right\| }}{{\left\| {x_{j} } \right\| + \left\| {y_{j} } \right\|}}$$\end{document}dXY=∑j=1hxj-yjxj+yj	[0, n]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{n}$$\end{document}d¯=dXYn	[0, 1]
Lance–Williams/Bray–Curtis (M11)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \frac{{\mathop \sum \nolimits_{j = 1}^{h} \left\| {x_{j} - y_{j} } \right\| }}{{\mathop \sum \nolimits_{j = 1}^{h} \left( {\left\| {x_{j} } \right\| + \left\| {y_{j} } \right\|} \right) }}$$\end{document}dXY=∑j=1hxj-yj∑j=1hxj+yj	[0, 1]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{n}$$\end{document}d¯=dXYn	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {0,\frac{1}{n}} \right]$$\end{document}0,1n
Clark/coefficient of divergence (M12)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \sqrt {\mathop \sum \limits_{j = 1}^{h} \left( {\frac{{x_{j} - y_{j} }}{{\left\| {x_{j} } \right\| + \left\| {y_{j} } \right\|}}} \right)^{2} }$$\end{document}dXY=∑j=1hxj-yjxj+yj2	[0, n]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{\sqrt n }$$\end{document}d¯=dXYn	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {0,\sqrt n } \right]$$\end{document}0,n
Soergel (M13)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \frac{1}{n}\mathop \sum \limits_{j = 1}^{h} \frac{{\left\| {x_{j} - y_{j} } \right\| }}{{max\left\{ {x_{j} ,y_{j} } \right\}}}$$\end{document}dXY=1n∑j=1hxj-yjmaxxj,yj	[0, 1]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{n}$$\end{document}d¯=dXYn	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {0,\frac{1}{n}} \right]$$\end{document}0,1n
Bhattacharyya (M14)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \sqrt {\mathop \sum \limits_{j = 1}^{h} \left( {\sqrt {x_{j} } - \sqrt {y_{j} } } \right)^{2} }$$\end{document}dXY=∑j=1hxj-yj2	[0, ∞)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{\sqrt n }$$\end{document}d¯=dXYn	[0, ∞)
Wave–Edges (M15)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{XY} = \mathop \sum \limits_{j = 1}^{h} \left( {1 - \frac{{min\left\{ {x_{j} ,y_{j} } \right\} }}{{max\left\{ {x_{j} ,y_{j} } \right\}}}} \right)$$\end{document}dXY=∑j=1h1-minxj,yjmaxxj,yj	[0, n]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{d} = \frac{{d_{XY} }}{n}$$\end{document}d¯=dXYn	[0, 1]
Angular separation/[1 − Cosine (Ochiai)] (M16)	d _XY = 1−Cos _XY where, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Cos_{XY} = \frac{{\varvec{XY}}}{{\varvec{XY}}} = \frac{{\mathop \sum \nolimits_{j = 1}^{h} x_{j} y_{j} }}{{\sqrt {\mathop \sum \nolimits_{j = 1}^{h} x_{j}^{2} \mathop \sum \nolimits_{j = 1}^{h} y_{j}^{2} } }}$$\end{document}CosXY=XYXY=∑j=1hxjyj∑j=1hxj2∑j=1hyj2	[0, 2]

aThe variables x and y are the values of the coordinate j of the atoms X and Y of a molecule, respectively. The h value is equal to 3 and corresponds to the 3D Cartesian coordinates (x, y, z) of an atom. The p values in Minkowski metric are 0.25, 0.5, 1 (Manhattan), 1.5, 2 (Euclidean), 2.5 and 3 (Minkowski)

b“Range” refers to “range” and not to “rank” and is defined as Range = max{x } − min{x }

Table 2

Measures used to compute the ternary (A) and quaternary (B) relations (multi-metrics) among atoms of a molecule

Measure	Formula
(A) Ternary measures (T _XYZ)
Perimeter (M19–M20)	T _XTZ = d _xy + d _yz + d _zx
Triangle area (M21–M22)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} T_{XYZ} & = \sqrt {s\left( {s - d_{XY} } \right)\left( {s - d_{YZ} } \right)\left( {s - d_{ZX} } \right)} \\ s & = \frac{{d_{XY} + d_{YZ} + d_{ZX} }}{2} \\ \end{aligned}$$\end{document}TXYZ=ss-dXYs-dYZs-dZXs=dXY+dYZ+dZX2
Sides summation (M25–M26)	T _XTZ = d _xy + d _yz
Bond angle (angle between sides) (m27–m28)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} & A_{X} ,A_{Y} ,A_{Z} \;coordinates\;of\;three\;atoms\;of\;a\;molecule \\ & U = A_{X} - A_{Y} ,\;\;V = A_{Z} - A_{Y} \\ & T_{XYZ} = \alpha = \arccos \left( {\frac{UV}{\left\| U \right\|\left\| V \right\|}} \right) \\ \end{aligned}$$\end{document}AX,AY,AZcoordinatesofthreeatomsofamoleculeU=AX-AY,V=AZ-AYTXYZ=α=arccosU∗VU∗V
(B) Quaternary measures (T _XYZ)
Perimeter (M19–M20)	Q _XTZW = d _XY + d _YZ + d _ZW + d _WX
Volume (M23–M24)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} A_{X} ,A_{Y} ,A_{Z} ,A_{W} \;coordinates\;of\;four\;atoms\;of\;a\;molecule \hfill \\ Q_{XYZW} = \frac{1}{6}\left( {\begin{array}{*{20}c} {A_{Y1} - A_{X1} } & {A_{Z1} - A_{X1} } & {A_{W1} - A_{X1} } \\ {A_{Y2} - A_{X2} } & {A_{Z2} - A_{X2} } & {A_{W2} - A_{X2} } \\ {A_{Y3} - A_{X3} } & {A_{Z3} - A_{X3} } & {A_{W3} - A_{X3} } \\ \end{array} } \right) \hfill \\ \end{aligned}$$\end{document}AX,AY,AZ,AWcoordinatesoffouratomsofamoleculeQXYZW=16AY1-AX1AZ1-AX1AW1-AX1AY2-AX2AZ2-AX2AW2-AX2AY3-AX3AZ3-AX3AW3-AX3
Sides summation (M25–M26)	Q _XTZW = d _XY + d _YZ + d _ZW
Dihedral angle (M29–M30)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} & A_{X} ,A_{Y} ,A_{Z} \;coordinates\;of\;three\;atoms\;of\;a\;molecule\;in\;the\;plane\;A \\ & B_{W} ,B_{Y} ,B_{Z} \;coordinates\;of\;three\;atoms\;of\;a\;molecule\;in\;the\;plane\;B \\ & U_{A} = \left( {A_{X} - A_{Y} } \right) \times \left( {A_{Z} - A_{y} } \right) \\ & U_{B} = \left( {B_{W} - A_{Y} } \right) \times \left( {B_{Z} - A_{y} } \right) \\ & Q_{XYZW} = \alpha = \arccos \left( {\frac{{U_{A} U_{B} }}{{\left\| {U_{A} } \right\|\left\| {U_{B} } \right\|}}} \right) \\ \end{aligned}$$\end{document}AX,AY,AZcoordinatesofthreeatomsofamoleculeintheplaneABW,BY,BZcoordinatesofthreeatomsofamoleculeintheplaneBUA=AX-AY×AZ-AyUB=BW-AY×BZ-AyQXYZW=α=arccosUA∗UBUA∗UB

Table 3

(A) Chemical structure of Chloro(methoxy)methane and its labeled molecular scaffold, (B) examples of two-tuple total spatial-(dis)similarity matrices for = 1 (order) calculated from different (dis-)similarity metrics, (C) example of three-tuple total spatial-(dis)similarity matrix for = 1 (order) calculated from bond angle ternary measure

(A) 3D molecular structure

(B) Two-tuple total spatial-(dis)similarity matrices, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbb{G}}^{1} $$\end{document}G1
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbb{G}}^{1} $$\end{document}G1 based on Euclidean metric					\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbb{G}}^{1} $$\end{document}G1 based on Lance-Williams metric
	C1	C2	O3	Cl4	C1	C2	O3	Cl4
C1	0.000	2.408	1.439	3.939	0.000	1.000	0.973	1.000
C2	2.408	0.000	1.438	1.757	1.000	0.000	0.954	0.293
O3	1.439	1.438	0.000	2.598	0.973	0.954	0.000	0.973
Cl4	3.939	1.757	2.598	0.000	1.000	0.293	0.973	0.000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbb{G}}^{1} $$\end{document}G1 based on Soergel metric					\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbb{G}}^{1} $$\end{document}G1 based on Angular Separation metric
	C1	C2	O3	Cl4	C1	C2	O3	O3
C1	0.000	1.158	1.003	1.709	0.000	1.354	0.558	1.875
C2	1.158	0.000	1.234	1.359	1.354	0.000	0.318	0.237
O3	1.003	1.234	0.000	2.235	0.558	0.318	0.000	0.952
Cl4	1.709	1.359	2.235	0.000	1.875	0.237	0.952	0.000
(C) Three-tuple total spatial-(dis)similarity matrix, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {{\mathbb{G}}{\mathbb{T}}}^{1} $$\end{document}GT1
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {{\mathbb{G}}{\mathbb{T}}}^{1} $$\end{document}GT1 slide 1ij					\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {{\mathbb{G}}{\mathbb{T}}}^{1} $$\end{document}GT1 slide 2ij
	C1	C2	O3	Cl4	C1	C2	O3	O3
C1	0.000	0.000	0.000	0.000	0.000	0.000	0.578	0.281
C2	0.000	0.000	0.578	2.470	0.000	0.000	0.000	0.000
O3	0.000	1.985	0.000	2.682	1.985	0.000	0.000	0.697
Cl4	0.000	0.390	0.163	0.000	0.390	0.000	0.553	0.000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {{\mathbb{G}}{\mathbb{T}}}^{1} $$\end{document}GT1 slide 3ij					\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {{\mathbb{G}}{\mathbb{T}}}^{1} $$\end{document}GT1 slide 4ij
C1	0.000	0.578	0.000	0.297	0.000	0.281	0.297	0.000
C2	0.578	0.000	0.000	1.892	2.470	0.000	1.892	0.000
O3	0.000	0.000	0.000	0.000	2.682	0.697	0.000	0.000
Cl4	0.163	0.553	0.000	0.000	0.000	0.000	0.000	0.000

Metrics used to compute the “distance” between two atoms of a molecule aThe variables x and y are the values of the coordinate j of the atoms X and Y of a molecule, respectively. The h value is equal to 3 and corresponds to the 3D Cartesian coordinates (x, y, z) of an atom. The p values in Minkowski metric are 0.25, 0.5, 1 (Manhattan), 1.5, 2 (Euclidean), 2.5 and 3 (Minkowski) b“Range” refers to “range” and not to “rank” and is defined as Range = max{x } − min{x } Measures used to compute the ternary (A) and quaternary (B) relations (multi-metrics) among atoms of a molecule (A) Chemical structure of Chloro(methoxy)methane and its labeled molecular scaffold, (B) examples of two-tuple total spatial-(dis)similarity matrices for = 1 (order) calculated from different (dis-)similarity metrics, (C) example of three-tuple total spatial-(dis)similarity matrix for = 1 (order) calculated from bond angle ternary measure From these total matrix approaches (, and ), local-fragments matrices may be computed in order to consider atom-types or chemical regions of interest and thus yielding the kthtwo-tuple, three-tuple and four-tuple-spatial-(dis)similarity matrices, denoted by , and , respectively (see Eq. 13 in Ref. [25] and Eqs. 17–18 in Ref. [26]). Specifically, the local-fragments (or atom-types), F, that could be taken into account to compute these indices include: hydrogen bond acceptors (A), carbon atoms in aliphatic chains (C), hydrogen bond donors (D), halogens (G), terminal methyl groups (M), carbon atoms in aromatic portion (P) and heteroatoms (X) (see Table 4 for examples).

Table 4

	C1	C2	O3	Cl4
(A) Two-tuple total spatial-(dis)similarity matrices, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{G}}^{1}$$\end{document}G1
C1	0.000	2.408	1.439	3.939
C2	2.408	0.000	1.438	1.757
O3	1.439	1.438	0.000	2.598
Cl4	3.939	1.757	2.598	0.000
(B) two-tuple local-fragment spatial-(dis)similarity matrices, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{G}}_{F}^{1}$$\end{document}GF1
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{G}}_{F}^{1}$$\end{document}GF1 based on halogens fragment
C1	0.000	0.000	0.000	1.969
C2	0.000	0.000	0.000	0.878
O3	0.000	0.000	0.000	1.299
Cl4	1.969	0.878	1.299	0.000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{G}}_{F}^{1}$$\end{document}GF1 based on methyl groups fragment
C1	0.000	1.204	0.719	1.969
C2	1.204	0.000	0.000	0.000
O3	0.719	0.000	0.000	0.000
Cl4	1.969	0.000	0.000	0.000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb{G}}_{F}^{1}$$\end{document}GF1 based on heteroatoms fragment
C1	0.000	0.000	0.719	1.969
C2	0.000	0.000	0.719	0.878
O3	0.719	0.719	0.000	2.598
Cl4	1.969	0.878	2.598	0.000

(A) Two-tuple spatial-(dis)similarity matrix for = 1, , computed from 3D coordinates of the molecule Chloro(methoxy)methane (see Table 1A), (B) examples of two-tuple spatial-(dis)similarity matrices, , obtained with different chemical fragments These total (or local-fragment) matrix approaches (, and ) are also known as kth-two-tuple, three-tuple and four-tuple total (or local-fragment) spatial-(dis)similarity matrices denoted by , and , respectively, because no normalizing procedure is used in their computation. Nonetheless, with the purpose of obtaining normalized matrix representations three probabilistic schemes may be employed to compute the QuBiLS-MIDAS MDs. In this way, the following normalized matrix representations are obtained from the corresponding non-stochastic matrices: the kth-two-tuple, three-tuple and four-tuple total (or local-fragment) spatial-(dis)similarity matrices [, and ] (see Eq. 10 in Ref. [25] and Eqs. 13–14 in Ref. [26]), the kth-two-tuple total (or local-fragment) spatial-(dis)similarity matrix [] (see Sinkhorn–Knopp algorithm in Ref. [37]) and the kthtwo-tuple, three-tuple and four-tuple total (or local-fragment) spatial-(dis)similarity matrices [, and ] (see Eq. 12 in Ref. [25] and Eqs. 15–16 in Ref. [26]). Table 5 shows the results obtained with the three probabilistic transformations on a two-tuple total spatial-(dis)similarity matrix.

Table 5

	C1	C2	O3	Cl4	C1	C2	O3	Cl4
	Non-stochastic matrix, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{ns} {\mathbb{G}}^{1}$$\end{document}nsG1				Simple-stochastic matrix, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{ss} {\mathbb{G}}^{1}$$\end{document}ssG1
C1	0.000	2.408	1.439	3.939	0.000	0.309	0.185	0.506
C2	2.408	0.000	1.438	1.757	0.430	0.000	0.257	0.314
O3	1.439	1.438	0.000	2.598	0.263	0.263	0.000	0.475
Cl4	3.939	1.757	2.598	0.000	0.475	0.212	0.313	0.000
	Double-stochastic matrix, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{ds} {\mathbb{G}}^{1}$$\end{document}dsG1				Mutual probability matrix, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{mp} {\mathbb{G}}^{1}$$\end{document}mpG1
C1	0.000	0.387	0.246	0.368	0.000	0.089	0.053	0.145
C2	0.387	0.000	0.368	0.246	0.089	0.000	0.053	0.065
O3	0.246	0.368	0.000	0.387	0.053	0.053	0.000	0.096
Cl4	0.368	0.246	0.387	0.000	0.145	0.065	0.096	0.000

Example of probabilistic transformations on the non-stochastic two-tuple total spatial-(dis)similarity matrix for = 1, , computed from 3D coordinates of the Chloro(methoxy)methane compound (see Table 1A) by using the Euclidean metric Finally, from the non-stochastic (simple-stochastic, double-stochastic or mutual-probability) total (or local-fragment) matrices [i.e. , and ], the corresponding atom-level matrices [denoted as , and , respectively] are calculated and their coefficients are used in the descriptors calculation (see Eqs. 1–3). Each atom-level matrix determines an atom-level index for atom “a” of a molecule and this value constitutes a component (entry) of the vector (L. Once the vector (L is computed then the global definition of the kthtwo-linear, three-linear and four-linear algebraic indices is obtained by applying over the entries of (L one or several aggregation operators (see Additional file 1: Table S2 for mathematical definition) [25, 26], which have been successfully employed in other reports [38-40]. In the Scheme 1 a general flowchart regarding the calculation process of the QuBiLS MIDAS MDs detailed in this section may be observed, while Scheme 2 is a graphic representation of each step employed in the computation of a specific two-linear algebraic index.

Scheme 1

Scheme 2

General workflow for the calculation of a two-linear descriptor based on the linear algebraic form, Euclidean metric, non-stochastic matrix approach, atomic mass as property and Manhattan aggregation operator. (1) Computation of the non-stochastic matrix for = 1 from the 3D coordinates matrix and using the Euclidean metric; (2) Computation of the molecular vector based on the atomic mass property, ; (3) Splitting of the matrix into “n” (number of atoms) atom-level matrices, , where “a” is an atom of the molecule; (4) Computation of the atom-level descriptors and saving them into vector ; and (5) Application of the Manhattan aggregation operator over the entries of the vector , being this value the molecular descriptor

General workflow for calculating the QuBiLS MIDAS molecular descriptors. (1) Computation of the molecular vectors according to selected atomic properties; (2) Computation from 3D Cartesian coordinates of each atom of a molecule the non-stochastic two-tuple, three-tuple or four-tuple total spatial-(dis)similarity matrices for = 1; (3) Consideration of atom-types or local-fragments (optional); (4) Computation of the simple-stochastic, double-stochastic and mutual probability matrices, as well as to determine the th matrices through Hadamard product until the k value selected; (5) Splitting the calculated matrices into atom-level matrices; (6) Computation of the atom-level indices (descriptors) using the molecular vectors calculated in the step (1); and (7) Application of the selected aggregation operators over vector of atom-level descriptors General workflow for the calculation of a two-linear descriptor based on the linear algebraic form, Euclidean metric, non-stochastic matrix approach, atomic mass as property and Manhattan aggregation operator. (1) Computation of the non-stochastic matrix for = 1 from the 3D coordinates matrix and using the Euclidean metric; (2) Computation of the molecular vector based on the atomic mass property, ; (3) Splitting of the matrix into “n” (number of atoms) atom-level matrices, , where “a” is an atom of the molecule; (4) Computation of the atom-level descriptors and saving them into vector ; and (5) Application of the Manhattan aggregation operator over the entries of the vector , being this value the molecular descriptor In order to automatize the calculation of the 3D N-linear algebraic indices used in the present manuscript the QuBiLS-MIDAS software has been developed [41]. This software has as one of its main features the multi-core processing of the MDs, as well as the option to carry out the distributed calculation of the indices by using the Multi-Server Distributed Computing Platform known as T-arenal [42]. The latter is particularly useful for high-throughput calculation tasks. Both software are freely available via internet at: http://tomocomd.com/.

Methods

In order to assess the correlation ability of the QuBiLS-MIDAS MDs for different biological activities eight well-known chemical datasets were used. These were previously employed by Sutherland et al. in a comparative study of QSAR methods commonly used in chemo-informatics analysis [23] and since then, these have been utilized as “benchmarks” for comparing results obtained in other approaches [43-47]. These datasets are comprised by angiotensin converting enzyme (ACE) inhibitors, acetylcholinesterase (AchE) inhibitors, ligands for the benzodiazepine receptor (BZR), cyclooxygenase-2 (COX2) inhibitors, dihydrofolate reductase inhibitors (DHFR), inhibitors of glycogen phosphorylase b (GPB), thermolysin inhibitors (THER) and thrombin inhibitors (THR). In this study the 3D coordinates were generated using CORINA software, and the same partitioning into training and test sets used in the initial study was considered in order to guarantee comparability of results. For these datasets, several configurations based on 3D two-linear, three-linear and four-linear algebraic indices were computed (see Additional file 1: Table S3) using the QuBiLS MIDAS software [41]. Due to the fact that numerous MDs are computed with this program yielding a high-dimensional space, then strategies for data reduction are necessary. In this sense, the following workflow for each set of indices calculated for each chemical dataset was performed only considering the training set compounds: The 1000 MDs with best variability behavior according to their Shannon’s Entropy values [29] were retained by using the IMMAN software [30]. The MDs with values represented as power of 10 (scientific notation) and whose exponents are greater or lesser than ±5 were removed. Filters for removing the MDs with correlation equal or greater than 0.95 and standardized entropy lesser than 0.3 were applied. The statistical method Multiple Linear Regression (MLR) implemented in the STATISTICA software was employed in order to select the MDs included in the model by using Forward Stepwise and Backward Stepwise selection procedures. The MDs retained after applying the previous steps and computed for the same compounds were merged into a single dataset. With the reduced data matrices for each chemical datasets, QSAR models were built with the MLR technique to determine the relationship between the response (activity) and predictor variables (MDs). The MLR technique is coupled with the Genetic Algorithm (GA) meta-heuristic as the variable selection method [48]. This strategy (MLR + GA) is implemented in the MobyDigs software (version 1.0) which was utilized to carry out this study [49]. In this sense, to perform the search process several populations with 100 3D N-linear MDs each were created, while the following configurations were used for the GA procedure: Number of iterations equal to 500,000, Population size equal to 100, Reproduction/mutation trade-off equal to 0.5, Selection bias was initially set to 0 (indicative of random selection) until achieving the 80 % of the maximum number of iterations and was later set to 1 (indicates tournament selection) in order to increase the selection pressure. The values of the previous parameters were selected according to the study performed by Todeschini et al. in Ref. [49]. The search process was carried out by using the (“leave-one-out” cross validation) statistical parameter as the fitness function. Once the exploration in each population was completed, then the MDs included in the built 9-variable models were retained with the purpose of creating new populations until 100 MDs. This process is repeated until achieving an only one population with 100 MDs as maximum. Finally, from the final population and for each compound dataset, 3–9 variable regression models were built for the corresponding biological activity. However, as the MobyDigs software generates a set of MLR models then the choice of the most suitable model was performed according to the following steps: The “best” 50 models according to the parameter were retained. To each model retained the validation methods “bootstrapping” [50] and “Y-scrambling” [51] (a(Q2)) were applied in order to assess the predictive power and the possible chance correlation with respect to the modeled biological activity, respectively. The former randomly creates training sets (with repeated objects) of the same size as the original and the objects left out constitute the test set, while the latter randomly changes the true response variables to determine the quality of the model. Both procedures were repeated 5000 and 300 times, respectively. These methods were applied due to the fact that procedure does not suffice to validate the stability of a predictive model [52]. For each model the function was computed, which takes into account the results obtained with the two validation procedures employed and the model with the smallest f(x) value constitutes the “best” regression model. The “best” regression model was assessed by using “external validation” procedure in the corresponding test set in order to measure its generalization ability.

Results and discussion

Assessment of the QuBiLS-MIDAS models versus other approaches

In this section the performance of the QuBiLS-MIDAS models for the chemical datasets described in section “Methods” is compared with respect to 16 QSAR methodologies (or descriptor sets) reported in the literature. The Table 6 shows the statistical parameters and equations of the best regression model based on total and local-fragment QuBiLS-MIDAS indices corresponding to each chemical dataset used in this report. In general sense, it can be observed that the bootstrapping validation coefficient calculated for each model presents a value greater than 0.6, indicative of the good predictive power of the built models. Also, the coefficients computed from scrambling tests (a(Q2)) have in all cases values inferior to 0.4, indicating reduced propensity to chance correlation. Lastly, the values achieved in the external prediction suggest that the models based on QuBiLS-MIDAS MDs have appropriate generalization ability, given that all parameters present values superior to 49 % of the total variance even when outlier compounds are retained in the validation set.

Table 6

Statistical parameters and equations of the best models developed for each chemical dataset analyzed

Size	R ²	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {Q_{\text{loo}}^{2} } \right)$$\end{document}Qloo2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {Q_{\text{boot}}^{2} } \right)$$\end{document}Qboot2	a(Q ²)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( {Q_{\text{ext}}^{2} } \right)$$\end{document}Qext2	SDEP_ext	Models^a
ACE dataset
6	0.814	0.7756	0.765	−0.169	0.7422	1.078	Act = 1.576 (±1.283) + 0.132 (±0.018) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{SD}}} \varvec{TrC}_{\varvec{e}}^{{\varvec{M}20\left( {\varvec{M}4} \right)}}$$\end{document}NS2SDTrCeM20M4 − 17.977 (±3.649) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}2}}^{{\varvec{RA}}} \varvec{B}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}1}}$$\end{document}SS2RABa-cM1 + 2.135 (±0.398) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}0}}^{{\varvec{RA}}} \varvec{B}_{{\varvec{a} - \varvec{e}}}^{{}}$$\end{document}SS0RABa-e − 3.900 (±0.772) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}1}}^{{\varvec{RA}}} \varvec{F}_{\varvec{a}}^{{\varvec{M}1}}$$\end{document}SS1RAFaM1 + 0.034 (±0.013) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{NS}3}}^{{\varvec{AC}\left[ 3 \right]\_\varvec{K}}} \varvec{TrC}_{\varvec{c}}^{{\varvec{M}20\left( {\varvec{M}16} \right)}} } \right]^{D}$$\end{document}NS3AC3_KTrCcM20M16D − 0.114 (±0.071) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{MP}1}}^{{\varvec{RA}}} \varvec{QuQd}_{\varvec{e}}^{{\varvec{M}29}} } \right]^{\varvec{X}}$$\end{document}MP1RAQuQdeM29X
ACHE dataset
8	0.738	0.6574	0.626	−0.213	0.6309	0.784	Act = 7.622 (±0.564) − 0.010 (±0.004) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}4}}^{{\varvec{i}50}} \varvec{TrQB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}21\left( {\varvec{M}3} \right)}}$$\end{document}SS4i50TrQBe-vM21M3 − 0.204 (±0.046) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}4}}^{\varvec{K}} \varvec{Tr}_{{\varvec{a} - \varvec{e} - \varvec{h}}}^{{\varvec{M}21\left( {\varvec{M}1} \right)}}$$\end{document}NS4KTra-e-hM21M1 + 3.311 (±0.673) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}1}}^{{\varvec{i}50}} \varvec{B}_{{\varvec{a} - \varvec{h}}}^{{\varvec{M}1}}$$\end{document}SS1i50Ba-hM1 − 111.324 (±30.793) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}2}}^{{\varvec{i}50}} \varvec{F}_{\varvec{a}}^{{\varvec{M}1}}$$\end{document}MP2i50FaM1 − 0.413 (±0.156) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}7}}^{{\varvec{ES}\_\varvec{SD}}} \varvec{TrB}_{{\varvec{a} - \varvec{e}}}^{{\varvec{M}21\left( {\varvec{M}13} \right)}}$$\end{document}SS7ES_SDTrBa-eM21M13 − 0.647 (±0.201) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}4}}^{{\varvec{TS}\left[ 2 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a} - \varvec{v}}}^{{\varvec{M}4}}$$\end{document}NS4TS2_KBa-vM4 + 0.022 (±0.011) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{NS}4}}^{\varvec{K}} \varvec{Tr}_{{\varvec{a} - \varvec{e} - \varvec{h}}}^{{\varvec{M}21\left( {\varvec{M}1} \right)}} } \right]^{\varvec{A}}$$\end{document}NS4KTra-e-hM21M1A − 1.747 (±0.699) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}1}}^{{\varvec{i}50}} \varvec{B}_{{\varvec{a} - \varvec{h}}}^{{\varvec{M}1}} } \right]^{\varvec{P}}$$\end{document}SS1i50Ba-hM1P
BZR dataset
9	0.754	0.6931	0.669	−0.170	0.5692	0.631	Act = 8.589 (±0.592) + 0.160 (±0.024) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}7}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{K}}} \varvec{Tr}_{{\varvec{a} - \varvec{e} - \varvec{h}}}^{{\varvec{M}19\left( {\varvec{M}11} \right)}}$$\end{document}SS7TS4_KTra-e-hM19M11 + 0.416 (±0.076) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}1}}^{{\varvec{RA}}} \varvec{B}_{{\varvec{c} - \varvec{v}}}^{{\varvec{M}2}}$$\end{document}SS1RABc-vM2 + 0.018 (±0.006) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}2}}^{{\varvec{i}50}} \varvec{TrB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}19\left( {\varvec{M}16} \right)}}$$\end{document}SS2i50TrBe-vM19M16 + 0.092 (±0.034) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{TS}\left[ 7 \right]\_\varvec{K}}} \varvec{Tr}_{{\varvec{a} - \varvec{h} - \varvec{c}}}^{{\varvec{M}27}}$$\end{document}NS2TS7_KTra-h-cM27 + 0.030 (±0.010) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{AC}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{c} - \varvec{e}}}^{{\varvec{M}2}}$$\end{document}NS2AC1_KBc-eM2 − 7.940 (±2.981) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}0}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{i}50}} \varvec{B}_{{\varvec{a} - \varvec{c}}}^{{}}$$\end{document}SS0TS4_i50Ba-c − 0.009 (±0.005) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}4}}^{{\varvec{AC}\left[ 4 \right]\_\varvec{K}}} \varvec{TrB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}20\left( {\varvec{M}13} \right)}} } \right]^{D}$$\end{document}SS4AC4_KTrBe-vM20M13D + 0. (±0.) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{NS}4}}^{{\varvec{AM}}} \varvec{QuQd}_{\varvec{v}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{C}$$\end{document}NS4AMQuQdvM26M8C + 0. (±0.) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{NS}4}}^{{\varvec{AM}}} \varvec{QuQd}_{\varvec{v}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{P}$$\end{document}NS4AMQuQdvM26M8P
COX2 dataset
9	0.670	0.6313	0.615	−0.091	0.4932	1.038	Act = –94.390 (±8.607) + 1.759 (±0.150) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}3}}^{{\varvec{ES}\_\varvec{N}1}} \varvec{B}_{{\varvec{v} - \varvec{e}}}^{{\varvec{M}3}}$$\end{document}MP3ES_N1Bv-eM3 − 0.032 (±0.007) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}4}}^{{\varvec{AC}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a} - \varvec{e}}}^{{\varvec{M}13}}$$\end{document}NS4AC1_KBa-eM13 + 0.317 (±0.070) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}0}}^{{\varvec{ES}\_\varvec{i}50}} \varvec{B}_{{\varvec{h} - \varvec{e}}}$$\end{document}SS0ES_i50Bh-e + 0.005 (±0.002) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}2}}^{{\varvec{SD}}} \varvec{TrQB}_{{\varvec{v} - \varvec{h}}}^{{\varvec{M}20\left( {\varvec{M}16} \right)}}$$\end{document}SS2SDTrQBv-hM20M16 + 0.021 (±0.005) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}4}}^{{\varvec{TS}\left[ 5 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}11}}$$\end{document}NS4TS5_KBa-cM11 + 0.081 (±0.017) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{AC}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{c} - \varvec{e}}}^{{\varvec{M}8}}$$\end{document}NS2AC1_KBc-eM8 − 17.442 (±3.695) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}4}}^{{\varvec{SD}}} \varvec{QuCB}_{{\varvec{h} - \varvec{c}}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{\varvec{D}}$$\end{document}SS4SDQuCBh-cM26M8D − 14.761 (±2.510) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}4}}^{{\varvec{SD}}} \varvec{QuCB}_{{\varvec{h} - \varvec{c}}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{\varvec{M}}$$\end{document}SS4SDQuCBh-cM26M8M + 122.311 (±50.893) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{MP}1}}^{{\varvec{SD}}} \varvec{Tr}_{{\varvec{a} - \varvec{h} - \varvec{c}}}^{{\varvec{M}20\left( {\varvec{M}16} \right)}} } \right]^{X}$$\end{document}MP1SDTra-h-cM20M16X
DHFR dataset
9	0.732	0.7055	0.697	−0.077	0.6405	0.826	Act = 3.127 (±0.519) + 0.019 (±0.005) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}1}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}21\left( {\varvec{M}2} \right)}}$$\end{document}SS1RATrBe-vM21M2 + 0.050 (±0.007) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}6}}^{{\varvec{GV}\left[ 4 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{c} - \varvec{e}}}^{{\varvec{M}4}}$$\end{document}NS6GV4_KBc-eM4 − 15.592 (±3.530) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}4}}^{{\varvec{TS}\left[ 2 \right]\_\varvec{i}50}} \varvec{QuQd}_{\varvec{m}}^{{\varvec{M}25\left( {\varvec{M}3} \right)}}$$\end{document}MP4TS2_i50QuQdmM25M3 − 0.067 (±0.007) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{GV}\left[ 3 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}1}}$$\end{document}NS2GV3_KBa-cM1 + 0.471 (±0.034) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}3}}^{{\varvec{GV}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{h} - \varvec{c}}}^{{\varvec{M}3}}$$\end{document}NS3GV1_KBh-cM3 − 0.325 (±0.037) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}1}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{c} - \varvec{e}}}^{{\varvec{M}1}}$$\end{document}NS1TS4_N1Bc-eM1 + 55.107 (±10.603) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}1}}^{{\varvec{GV}\left[ 5 \right]\_\varvec{SD}}} \varvec{B}_{{\varvec{c} - \varvec{e}}}^{{\varvec{M}3}}$$\end{document}NS1GV5_SDBc-eM3 + 0.044 (±0.008) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{TS}\left[ 3 \right]\_\varvec{SD}}} \varvec{B}_{{\varvec{v} - \varvec{e}}}^{{\varvec{M}4}}$$\end{document}NS2TS3_SDBv-eM4 − 0.933 (±0.331) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}4}}^{{\varvec{N}1}} \varvec{Qu}_{{\varvec{e} - \varvec{v} - \varvec{h} - \varvec{c}}}^{{\varvec{M}26\left( {\varvec{M}3} \right)}}$$\end{document}MP4N1Que-v-h-cM26M3
GPB dataset
8	0.893	0.8124	0.774	−0.394	0.8283	0.499	Act = 2.073 (±0.351) + 0.334 (±0.078) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}3}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{K}}} \varvec{TrB}_{{\varvec{e} - \varvec{h}}}^{{\varvec{M}20\left( {\varvec{M}8} \right)}}$$\end{document}SS3TS4_KTrBe-hM20M8 + 0.147 (±0.051) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{AC}\left[ 3 \right]\_\varvec{K}}} \varvec{F}_{\varvec{e}}^{{\varvec{M}8}}$$\end{document}NS2AC3_KFeM8 + 0.046 (±0.009) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}3}}^{{\varvec{AC}\left[ 4 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{c} - \varvec{v}}}^{{\varvec{M}12}}$$\end{document}SS3AC4_N1Bc-vM12 + 55.958 (±10.078) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}2}}^{{\varvec{AC}\left[ 2 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}8}}$$\end{document}SS2AC2_N1Ba-cM8 + 0.050 (±0.039) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}4}}^{{\varvec{N}1}} \varvec{Tr}_{{\varvec{e} - \varvec{v} - \varvec{c}}}^{{\varvec{M}19\left( {\varvec{M}12} \right)}}$$\end{document}SS4N1Tre-v-cM19M12 + 0.078 (±0.055) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}3}}^{{\varvec{GV}\left[ 2 \right]\_\varvec{K}}} \varvec{F}_{\varvec{a}}^{{\varvec{M}11}}$$\end{document}NS3GV2_KFaM11 + 1.322 (±0.427) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}0}}^{{\varvec{SD}}} \varvec{QuQTr}_{{\varvec{e} - \varvec{v} - \varvec{h}}}^{{}}$$\end{document}MP0SDQuQTre-v-h − 0.309 (±0.108) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}4}}^{{\varvec{SD}}} \varvec{QuQTr}_{{\varvec{e} - \varvec{v} - \varvec{h}}}^{{\varvec{M}26\left( {\varvec{M}3} \right)}}$$\end{document}MP4SDQuQTre-v-hM26M3
THER dataset
7	0.815	0.7530	0.723	−0.260	0.7248	1.197	Act = –11.296 (±3.486) + 126.508 (±41.628) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}1}}^{{\varvec{GV}\left[ 5 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}8}}$$\end{document}NS1GV5_N1Ba-cM8 + 0.016 (±0.003) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}1}}^{{\varvec{GV}\left[ 7 \right]\_\varvec{i}50}} \varvec{Q}_{\varvec{e}}^{{\varvec{M}8}}$$\end{document}NS1GV7_i50QeM8 − 4.265 (±0.851) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}1}}^{{\varvec{N}1}} \varvec{Tr}_{{\varvec{v} - \varvec{h} - \varvec{c}}}^{{\varvec{M}20\left( {\varvec{M}3} \right)}}$$\end{document}SS1N1Trv-h-cM20M3 + 0.718 (±0.171) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}3}}^{{\varvec{RA}}} \varvec{TrC}_{\varvec{e}}^{{\varvec{M}20\left( {\varvec{M}3} \right)}}$$\end{document}SS3RATrCeM20M3 + 0.016 (±0.009) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}4}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}27}}$$\end{document}SS4RATrBe-vM27 − 0.027 (±0.029) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}4}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}27}} } \right]^{A}$$\end{document}SS4RATrBe-vM27A + 0.042 (±0.027) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}4}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e} - \varvec{v}}}^{{\varvec{M}27}} } \right]^{X}$$\end{document}SS4RATrBe-vM27X
THR dataset
9	0.866	0.8149	0.789	−0.286	0.7674	0.540	Act = 5.251 (±0.605) − 2120.900 (±253.086) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}2}}^{{\varvec{TS}\left[ 1 \right]\_\varvec{i}50}} \varvec{Tr}_{{\varvec{a} - \varvec{h} - \varvec{c}}}^{{\varvec{M}19\left( {\varvec{M}2} \right)}}$$\end{document}MP2TS1_i50Tra-h-cM19M2 − 0.0001 (±0.) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}0}}^{{\varvec{TS}\left[ 5 \right]\_\varvec{i}50}} \varvec{Tr}_{{\varvec{e} - \varvec{v} - \varvec{h}}}^{{}}$$\end{document}NS0TS5_i50Tre-v-h + 0.060 (±0.013) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}1}}^{{\varvec{AC}\left[ 2 \right]\_\varvec{K}}} \varvec{TrQB}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}27}}$$\end{document}SS1AC2_KTrQBa-cM27 + 0.022 (±0.004) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}3}}^{{\varvec{RA}}} \varvec{Tr}_{{\varvec{e} - \varvec{v} - \varvec{h}}}^{{\varvec{M}20\left( {\varvec{M}2} \right)}}$$\end{document}NS3RATre-v-hM20M2 + 1.415 (±0.222) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{RA}}} \varvec{TrQB}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}20\left( {\varvec{M}8} \right)}}$$\end{document}NS2RATrQBa-cM20M8 + 0.958 (±0.293) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{NS}2}}^{{\varvec{GV}\left[ 4 \right]\_\varvec{PN}}} \varvec{B}_{{\varvec{c} - \varvec{v}}}^{{\varvec{M}8}}$$\end{document}NS2GV4_PNBc-vM8 + 0.107 (±0.041) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{SS}4}}^{\varvec{K}} \varvec{Tr}_{{\varvec{e} - \varvec{v} - \varvec{h}}}^{{\varvec{M}21\left( {\varvec{M}8} \right)}}$$\end{document}SS4KTre-v-hM21M8 + 0.029 (±0.012) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}_{{\varvec{MP}4}}^{{\varvec{AC}\left[ 7 \right]\_\varvec{K}}} \varvec{Tr}_{{\varvec{a} - \varvec{e} - \varvec{c}}}^{{\varvec{M}19\left( {\varvec{M}13} \right)}}$$\end{document}MP4AC7_KTra-e-cM19M13 − 0.058 (±0.022) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left[ {{}_{{\varvec{SS}1}}^{{\varvec{AC}\left[ 2 \right]\_\varvec{K}}} \varvec{TrQB}_{{\varvec{a} - \varvec{c}}}^{{\varvec{M}27}} } \right]^{\varvec{C}}$$\end{document}SS1AC2_KTrQBa-cM27C

aSee Additional file 1: Table S7 for nomenclature of the QuBiLS-MIDAS descriptors

Statistical parameters and equations of the best models developed for each chemical dataset analyzed aSee Additional file 1: Table S7 for nomenclature of the QuBiLS-MIDAS descriptors On the other hand, the Tables 7 and 8 show the comparisons with respect to other approaches reported in the literature, as well as the results obtained by the models based on total QuBiLS-MIDAS MDs exclusively (see Additional file 1: Table S4 for information related with the best models from 3 to 9 variables). In this manner, the importance of considering local-fragments (atom-types or group) in the calculation of the QuBiLS-MIDAS MDs and subsequently in the building of QSAR models can be analyzed. As can be observed in both tables, the performance of the QuBiLS-MIDAS models is superior when local-fragments are considered with respect to those QuBiLS-MIDAS models that do not use them. Particularly, it can be noted that in 6 of the 8 datasets studied the parameter is rather comparable, while better performances are attained according to . Both parameters for the COX2 dataset present the best improvements, achieving in the external prediction a value greater than 49 % of the total variance, while no other QSAR procedure outperforms this threshold. On the other hand, only in the DHFR and GPB datasets does the utilization of the local-fragment QuBiLS-MIDAS MDs not influence the performance of the developed QSAR models. It can thus be stated that considering a mixture of total and local-fragment QuBiLS-MIDAS MDs in building of QSAR models contributes to the improvement of the predictive ability.

Table 7

Comparison of the cross-validation statistic parameter obtained from the QuBiLS-MIDAS models with respect to the performance achieved by 15 QSAR procedures

	ACE	ACHE	BZR	COX2	DHFR	GPB	THER	THR
QuBiLS-MIDAS^a	0.7756	0.6574	0.6931	0.6313	0.7055	0.8124	0.7530	0.8149
QuBiLS-MIDAS^b	0.7713	0.6521	0.6886	0.6064	0.7055	0.8124	0.7495	0.8047
CoMFA [23]	0.68	0.52	0.32	0.49	0.65	0.42	0.52	0.59
COMSIA basic [23]	0.65	0.48	0.41	0.43	0.63	0.43	0.54	0.62
COMSIA extra [23]	0.66	0.49	0.45	0.57	0.65	0.61	0.51	0.72
EVA [23]	0.70	0.42	0.40	0.45	0.64	0.58	0.48	0.47
HQSAR [23]	0.72	0.34	0.42	0.50	0.69	0.66	0.49	0.50
2D [23]	0.68	0.32	0.36	0.49	0.51	0.31	0.62	0.62
2.5D [23]	0.72	0.31	0.35	0.55	0.53	0.46	0.66	0.52
SAMFA-RF [43]	0.69	0.58	0.43	0.38	0.70	0.66	0.52	0.53
SAMFA-SVM [43]	0.52	0.29	0.38	0.39	0.57	0.53	0.18	0.39
SAMFA-PLS [43]	0.65	0.54	0.49	0.40	0.68	0.61	0.60	0.56
Fingerprints Library [44]	0.69	0.57	0.56	0.55	0.76	0.53	0.53	0.58
O3Q [45]	0.69	0.52	0.42	0.48	0.70	0.55	0.48	0.59
O3QMFA [46]	0.65	0.41	0.41	0.43	0.69	0.30	0.47	0.65
O3A/O3Q [45]	0.71	0.55	0.46	0.46	0.66	0.50	0.67	0.68
COSMOsar3D [46]	0.71	0.53	0.45	0.54	0.69	0.61	0.58	0.74

a values corresponding to the best model reported considering total and local-fragment QuBiLS-MIDAS indices (see Table 6)

b values corresponding to the best model reported considering only total QuBiLS-MIDAS indices (see Additional file 1: Table S4)

Italic values correspond to the best results reported in the literature and those obtained by the QuBiLS-MIDAS 3D-MDs

Table 8

Comparison of the external predictive accuracy attained by the QuBiLS-MIDAS models with respect to the generalization ability achieved with 12 QSAR procedures

	ACE	ACHE	BZR	COX2	DHFR	GPB	THER	THR
QuBiLS-MIDAS^a	0.7422	0.6309	0.5692	0.4932	0.6405	0.8283	0.7248	0.7674
QuBiLS-MIDAS^b	0.7255	0.5989	0.5459	0.4660	0.6405	0.8283	0.7061	0.7498
CoMFA [23]	0.49	0.47	0.00	0.29	0.59	0.42	0.54	0.63
COMSIA basic [23]	0.52	0.44	0.08	0.03	0.52	0.46	0.36	0.55
COMSIA extra [23]	0.49	0.44	0.12	0.37	0.53	0.59	0.53	0.63
EVA [23]	0.36	0.28	0.16	0.17	0.57	0.49	0.36	0.11
HQSAR [23]	0.30	0.37	0.17	0.27	0.63	0.58	0.53	−0.25
2D [23]	0.47	0.16	0.14	0.25	0.47	−0.06	0.14	0.04
2.5D [23]	0.51	0.16	0.20	0.27	0.49	0.04	0.07	0.28
O3Q [45]	0.69	0.67	0.17	0.32	0.60	0.50	0.51	0.67
O3QMFA [46]	0.45	0.61	0.13	0.37	0.59	0.29	0.49	0.60
O3A/O3Q [45]	0.54	0.65	0.24	0.28	0.53	0.41	−0.18	0.30
COSMOsar3D [46]	0.62	0.61	0.13	0.43	0.58	0.63	0.59	0.66
2D-FPT [47]	0.713 ^L	0.714 ^N	0.378 ^L	0.329^N	0.683 ^N	0.667 ^L	0.649 ^L	0.737 ^N

a values corresponding to the best model reported considering total and local-fragment QuBiLS-MIDAS indices (see Table 6)

b values corresponding to the best model reported considering only total QuBiLS-MIDAS indices (see Additional file 1: Table S4)

L2D-FPT-based linear models

N2D-FPT-based non-linear models

Italic values correspond to the best results reported in the literature and those obtained by the QuBiLS-MIDAS 3D-MDs

Comparison of the cross-validation statistic parameter obtained from the QuBiLS-MIDAS models with respect to the performance achieved by 15 QSAR procedures a values corresponding to the best model reported considering total and local-fragment QuBiLS-MIDAS indices (see Table 6) b values corresponding to the best model reported considering only total QuBiLS-MIDAS indices (see Additional file 1: Table S4) Italic values correspond to the best results reported in the literature and those obtained by the QuBiLS-MIDAS 3D-MDs Comparison of the external predictive accuracy attained by the QuBiLS-MIDAS models with respect to the generalization ability achieved with 12 QSAR procedures a values corresponding to the best model reported considering total and local-fragment QuBiLS-MIDAS indices (see Table 6) b values corresponding to the best model reported considering only total QuBiLS-MIDAS indices (see Additional file 1: Table S4) L2D-FPT-based linear models N2D-FPT-based non-linear models Italic values correspond to the best results reported in the literature and those obtained by the QuBiLS-MIDAS 3D-MDs Also, it can be observed from Table 7 that the cross-validation performances achieved by the QuBiLS-MIDAS models have comparable-to-superior behavior with respect to the approaches reported in the literature. Until now, the best value for the datasets ACE, ACHE, BZR, COX2, GPB, THER and THR had been attained by the procedures HQSAR (and 2.5D) [ = 0.72], SAMFA-RF ( = 0.58), All-Shortest Path [ASP] Fingerprint ( = 0.56), COMSIA extra ( = 0.57), HQSAR (and SAMFA-RF) [ = 0.66], O3A/O3Q ( = 0.67) and COMSIA extra ( = 0.72), respectively, by using PLS, Random Forest (RF) or Support Vector Machine (SVM) techniques. However, all these previous results are clearly outperformed by the QuBiLS-MIDAS models [(ACE, = 0.7756), (ACHE, = 0.6574), (BZR, = 0.6931), (COX2, = 0.6313), (GPB, = 0.8124), (THER, = 0.7530) and (THR, = 0.8149)], which were built with MLR that is a simpler method than those employed in the reported results. In the specific case of the DHFR dataset, although the attained value ( = 0.7055) with the QuBiLS-MIDAS approach is not better than the current best result (ASP fingerprint, = 0.76), the former is superior to the remaining QSAR procedures. However, it is important to remark that the best model (ASP fingerprint + SVM) for the DHFR dataset does not have the external prediction value () reported and thus the corresponding could be overoptimistic. According to the external predictions, it can be observed in the Table 8 that the models based on QuBiLS-MIDAS indices yield comparable-to-superior performances with respect to the results reported in the literature. Specifically, the models for ACE ( = 0.7422), BZR ( = 0.5692), COX2 ( = 0.4932), GPB ( = 0.8283), THER ( = 0.7248) and THR ( = 0.7674) test sets outperform the best results reported up to date for each dataset previously mentioned, which correspond to COSMOsar3D ( = 0.43) in COX2 and to the 2D-FPT methodology in the other datasets [(ACE, = 0.713), (BZR, = 0.378), (GPB, = 0.667), (THER, = 0.649) and (THR, = 0.737)]. The 2D-FPT models were developed by using SQS framework that determines linear and non-linear models (see Table 8), while the model corresponding to COSMOsar3D is based on the PLS technique. Even so, the obtained MLR models have better predictive accuracy, even when these are compared with respect to more complex or similar procedures. As for the ACHE and DHFR datasets, the predictive power obtained for models built with the QuBiLS-MIDAS approach is inferior to the best results reported so far in the literature. In the former dataset, the methods 2D-FPT ( = 0.714), O3Q ( = 0.67) and O3A/O3Q ( = 0.65) offer better predictions than the proposed model ( = 0.6309), albeit this can be considered as suitable (explains 63 % of total variance). Additionally, when the DHFR test set is taken into account the 2D-FPT approach ( = 0.683) has more predictive ability than the corresponding QuBiLS-MIDAS model ( = 0.6405), but the latter is superior to the remaining methodologies. Nonetheless, it is important to highlight that the procedures O3Q and O3A/O3Q are alignment dependent and thus their use is generally restricted to congeneric datasets [45]. In the specific case of the 2D-FPT methodology for ACHE and DHFR datasets, the achieved results are based on non-linear models while the proposed outcomes are determined with linear models. The obtained results evidence that the QuBiLS-MIDAS MDs properly codify structural information of the molecules considering interactions among N (N = 2, 3, 4) atoms and thus are suitable for developing QSAR models that contribute to the prediction of biological activity in novel structures. However, notwithstanding the comparable-to-superior predictions achieved by the proposed models, it is important to statistically validate these results.

Statistical analysis of the external predictive accuracy

To perform this analysis the values corresponding to the external predictions () obtained by the QuBiLS-MIDAS models were taken into consideration as well as the ones reported in the literature over the external compounds belonging to each dataset (see Table 8). Firstly, a descriptive analysis through boxplot graphics was performed (with SPSS software) and the obtained results are represented in Fig. 1. As can be observed, the QuBiLS-MIDAS and 2D-FTP models tend to have a similar behavior and superior to the remaining procedures. Also, it can be noted that the highest prediction among all procedures analyzed is achieved by the QuBiLS-MIDAS models. In addition, taking into account the graphics corresponding to the QuBiLS-MIDAS and 2D-FPT approaches, it can be concluded that the predictions obtained by the former are less scattered than those attained by the latter and thus, the QuBiLS-MIDAS models have a more suitable external predictive ability irrespective of the chemical dataset analyzed. However, these results are not enough to state that the models based on QuBiLS-MIDAS MDs are statistically the best.

Fig. 1

Boxplot graphic for the external predictive accuracy achieved by each QSAR methodology considered in this manuscript

Boxplot graphic for the external predictive accuracy achieved by each QSAR methodology considered in this manuscript Therefore, an exploratory study was performed to analyze the normality of the data by using Kolmogorov–Smirnov (K–S) test corrected by Lilliefors [53] and the Shapiro–Wilk test [54]. This was done in order to guarantee that the variable is not normally distributed, at least for one model, and so to ensure that the non-parametric tests are the proper choice. As can be observed in Additional file 1: Table S5, the null hypotheses of normality can only be rejected with a high certainty for values in the 2D-FTP and COSMOsar3D models, although with Shapiro–Wilk test the rejection of the null hypothesis is achieved for COMSIA basic as well. Therefore the non-parametric tests may be considered as suitable for this statistical analysis. Subsequently, a Friedman test [55] for multiple comparisons was performed taking into consideration the results of all QSAR procedures. As can be seen in Additional file 1: Table S6A, there are global differences among the considered methods, with the QuBiLS-MIDAS models being those with the best performance followed by the 2D-FPT, O3Q and COSMOsar3D approaches, respectively, with a Kendall’s W [56] concordance level of 0.607 (see Additional file 1: Table S6B). In order to determine the specific statistical differences a Wilcoxon signed-ranks test [57] was carried out (with R software) by using Benjamini and Hochberg [58] (BH) as the adjustment method (one-tailed p values calculation) for controlling the false discovery rate (FDR). The results of this analysis are shown in Table 9, where a significant pvalue (p value <0.05) means that the row approach is superior to the corresponding column. So, it can be noted that the QuBiLS-MIDAS models yield statistically better predictions than the other methodologies considered, with the exception of the 2D-FPT approach.

Table 9

Wilcoxon signed-rank test for pairwise multiple hypothesis tests by using BH as adjustment method for controlling FDR. It shows the one-tailed p-values for the greater alternative

	2D	2.5D	EVA	COMSIA basic	HQSAR	O3QMFA	CoMFA	O3A/O3Q	COMSIA extra	COSMO sar3D	O3Q	2D-FPT
2.5D	0.115	–	–	–	–	–	–	–	–	–	–	–
EVA	0.138	0.402	–	–	–	–	–	–	–	–	–	–
COMSIA basic	0.137	0.115	0.323	–	–	–	–	–	–	–	–	–
HQSAR	0.203	0.380	0.197	0.402	–	–	–	–	–	–	–	–
O3QMFA	0.046	0.046	0.138	0.241	0.312	–	–	–	–	–	–	–
CoMFA	0.051	0.089	0.115	0.241	0.367	0.703	–	–	–	–	–	–
O3A/O3Q	0.089	0.089	0.277	0.556	0.402	0.654	0.727	–	–	–	–	–
COMSIA extra	0.031	0.051	0.045	0.051	0.164	0.427	0.249	0.272	–	–	–	–
COSMOsar3D	0.027	0.022	0.036	0.022	0.051	0.054	0.027	0.068	0.015	–	–	–
O3Q	0.015	0.022	0.022	0.015	0.186	0.051	0.042	0.051	0.203	0.698	–	–
2D-FPT	0.015	0.015	0.015	0.015	0.015	0.022	0.015	0.015	0.022	0.068	0.015	–
QuBiLS MIDAS	0.015	0.015	0.015	0.015	0.015	0.015	0.015	0.022	0.015	0.015	0.022	0.138

Italic values indicate statistically significant differences of the QuBiLS-MIDAS models with respect to the other QSAR methodologies

Wilcoxon signed-rank test for pairwise multiple hypothesis tests by using BH as adjustment method for controlling FDR. It shows the one-tailed p-values for the greater alternative Italic values indicate statistically significant differences of the QuBiLS-MIDAS models with respect to the other QSAR methodologies

Analysis of the predictive ability according to conformer generation methods

The conformer generation constitutes an important step when chemoinformatics tasks are performed, particularly in the computer-aided drug design, where the outcomes of a virtual screening process may depend on 3D structures employed to build the procedure to be used, e.g. a QSAR model [59]. Therefore, in this section an evaluation of the sensibility of the QuBiLS-MIDAS MDs to the different conformer generation methods is performed in order to comprehend how these could affect in the performance of the indices. To this end, the software FROG2 [60], RDKit [61], BALLOON [62], OpenBabel [63] and Standardizer ChemAxon [64] were employed to generate the 3D structures, taking as starting point the SMILES representations corresponding to the eight compound datasets considered in this report. Firstly, a study with the purpose of knowing if the models developed using the training structures generated with CORINA (see Table 6) are applicable to the test structures generated with the previously mentioned programs was performed. The external predictive abilities obtained after performing this study are graphically represented in Fig. 2. These results are significantly inferior to those achieved with the test sets based on CORINA (see Additional file 1: Table S8), with the exception of RDKIT. This demonstrates that QSAR models based on QuBiLS-MIDAS MDs are not suitable to predict biological activity into compounds optimized with other procedure different from than used for the training structures. Thus, it can be stated that the performance of the QuBiLS-MIDAS MDs depend on 3D conformations from which are computed.

Fig. 2

Boxplot graphic for the external predictive accuracy achieved by the QSAR models reported in this manuscript (see Table 6) and fitted using structures generated by CORINA software, over the corresponding test sets optimized by five different toolkits It is important to highlight that the previous results do not mean that CORINA software is the most suitable to generate the 3D structures to be used in the development of the QSAR models based on QuBiLS-MIDAS MDs. In this sense, in order to prove this assertion the following simple workflow was carried out considering the conformations generated by each previously mentioned program (including CORINA) for each chemical dataset: 8640 two-linear algebraic indices (Additional file 1: Table S9) were computed. CfsSubsetEval feature selection procedure, implemented in WEKA software, was applied in order to retain those MDs with high correlation according to dependent-variable and with low intercorrelation among them. The MLR-GA procedure implemented in MobyDigs software was employed to build 9-variable models performing 100,000 iterations and considering the tabu list options of removing MDs with correlation equal or greater than 0.95, fourth order moment greater than 8 and standardized entropy lesser than 0.3. The fitness function used was the statistical parameter . The model with the highest value was selected as the best model, to which the external predictive ability was determined. Table 10 shows the external predictive power of the models developed from different 3D conformations, as well as the average of the rankings corresponding to the conformer generation methods considered in this study. As can be observed, the best predictions are achieved by the models built from 3D molecular structures generated by FROG2 procedure, followed by the results obtained from the methods CORINA, CHEMAXON, RDKIT, OPENBABEL and BALLOON, respectively. However, in Additional file 1: Table S10 is demonstrated through a Friedman test that there exists no global statistic differences among previous predictions, which proves, at least for this preliminary study that with QuBiLS-MIDAS MDs can be developed QSAR models with good predictive accuracy irrespective of the procedure used to obtain optimized structures.

Table 10

External predictive accuracy achieved by QSAR models developed from 3D molecular structures generated with six different programs

	ACE	ACHE	BZR	COX2	DHFR	GPB	THER	THR	Rank average
BALLOON	0.3296	0.1943	0.3949	0.2451	0.3758	0.0000	0.0000	0.0000	4.5
CHEMAXON	0.5504	0.1343	0.4163	0.3361	0.2978	0.1687	0.0000	0.1386	3.375
CORINA	0.4133	0.0556	0.3628	0.2865	0.4288	0.2767	0.1915	0.2334	3.25
FROG2	0.4832	0.3535	0.3635	0.3393	0.3786	0.2712	0.3264	0.1457	2.125
OPENBABEL	0.3993	0.1306	0.1715	0.2775	0.3460	0.4742	0.2806	0.0803	4
RDKIT	0.4181	0.1770	0.3024	0.2189	0.5008	0.4511	0.0000	0.0710	3.75

External predictive accuracy achieved by QSAR models developed from 3D molecular structures generated with six different programs Note that for the forthcoming version of QuBiLS-MIDAS software, RDKIT program will be incorporated in the QuBiLS-MIDAS software as a built-in option for conformer generation. This is due to the fact that FROG2 procedure can only be accessed using a web browser, while CORINA and CHEMAXON software are not freely available for use. In addition, according to a study performed in Ref. [65] in order to assess the quality of the conformations generated by several free methods, RDKIT tends to generate the most similar conformations to the experimental structures, in addition to being the second fastest among all toolkits analyzed.

Conclusions

In this report the predictive accuracy of the novel alignment-free geometric molecular descriptors based on N-linear algebraic maps (so called QuBiLS-MIDAS) has been examined. To this end, QSAR models for predicting the biological activity in eight molecular datasets were developed by using MLR as statistical technique. The results obtained with the QuBiLS-MIDAS models were compared with respect to several QSAR procedures reported in the literature according to the correlation coefficients achieved with the leave-one-out cross-validation and external prediction methods, and generally superior performances were observed with this QuBiLS-MIDAS framework. A few exceptions were observed: for the parameter, the QuBiLS-MIDAS approach is exclusively outperformed by the ASP-based (fingerprint) method in the DHFR dataset, while for the parameter, the QuBiLS-MIDAS method yields inferior results with respect to the 2D-FPT methodology in the DHFR and ACHE test set, respectively. Also, inferior values are yielded by the QuBiLS-MIDAS approach with respect to the O3Q and O3A/O3Q procedures in the ACHE test set. However, these previous methodologies are based on techniques more complex than MLR and/or cannot be used in non-congeneric datasets because are alignment-depend. Thus, considering the maximum parsimony principle (Ockham’s razor), the QuBiLS-MIDAS approach seems to be more suitable than the other QSAR methods. Additionally, several steps for statistically validating the obtained results are detailed. In this sense, the external predictive ability of the developed models was compared with respect to other methodologies by means of the multiple comparison tests. It was demonstrated that the QuBiLS-MIDAS models yield the best predictions, and that these are significantly superior in 11 of the 12 methodologies compared. Therefore, it can be suggested that the 3D Algebraic N-linear molecular descriptors (also known as QuBiLS-MIDAS) are suitable for extracting structural information of the molecules and thus, constitute a promissory alternative to build models that contribute to the prediction of pharmacokinetic, pharmacodynamics and toxicological properties of novel compounds.

29 in total

1. Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations

Authors:
Journal: J Chem Inf Comput Sci Date: 2000-05

2. Evaluation of a novel molecular vibration-based descriptor (EVA) for QSAR studies: 2. Model validation using a benchmark steroid dataset.

Authors: D B Turner; P Willett; A M Ferguson; T W Heritage
Journal: J Comput Aided Mol Des Date: 1999-05 Impact factor: 3.686

3. Prediction of enantiomeric selectivity in chromatography. Application of conformation-dependent and conformation-independent descriptors of molecular chirality.

Authors: João Aires-de-Sousa; Johann Gasteiger
Journal: J Mol Graph Model Date: 2002-03 Impact factor: 2.518

4. Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3D molecular descriptors.

Authors: Viviana Consonni; Roberto Todeschini; Manuela Pavan
Journal: J Chem Inf Comput Sci Date: 2002 May-Jun

5. A comparison of methods for modeling quantitative structure-activity relationships.

Authors: Jeffrey J Sutherland; Lee A O'Brien; Donald F Weaver
Journal: J Med Chem Date: 2004-10-21 Impact factor: 7.446

6. Generating conformer ensembles using a multiobjective genetic algorithm.

Authors: Mikko J Vainio; Mark S Johnson
Journal: J Chem Inf Model Date: 2007-09-25 Impact factor: 4.956

7. SAMFA: simplifying molecular description for 3D-QSAR.

Authors: John Manchester; Ryszard Czermiński
Journal: J Chem Inf Model Date: 2008-05-27 Impact factor: 4.956

8. Comparative spectra analysis (CoSA): spectra as three-dimensional molecular descriptors for the prediction of biological activities.

Authors: R Bursi; T Dao; T van Wijk; M de Gooyer; E Kellenbach; P Verwer
Journal: J Chem Inf Comput Sci Date: 1999 Sep-Oct

9. Fuzzy tricentric pharmacophore fingerprints. 2. Application of topological fuzzy pharmacophore triplets in quantitative structure-activity relationships.

Authors: Fanny Bonachéra; Dragos Horvath
Journal: J Chem Inf Model Date: 2008-02-07 Impact factor: 4.956

10. The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics.

Authors: Christoph Steinbeck; Yongquan Han; Stefan Kuhn; Oliver Horlacher; Edgar Luttmann; Egon Willighagen
Journal: J Chem Inf Comput Sci Date: 2003 Mar-Apr

10 in total

1. Modeling and insights into molecular basis of low molecular weight respiratory sensitizers.

Authors: Xueyan Cui; Rui Yang; Siwen Li; Juan Liu; Qiuyun Wu; Xiao Li
Journal: Mol Divers Date: 2020-03-12 Impact factor: 2.943

2. Polarizability: a promising descriptor to study chemical-biological interactions.

Authors: Hiteshi Tandon; Prabhat Ranjan; Tanmoy Chakraborty; Vandana Suhag
Journal: Mol Divers Date: 2020-03-07 Impact factor: 2.943

3. Application of Supervised SOM Algorithms in Predicting the Hepatotoxic Potential of Drugs.

Authors: Viktor Drgan; Benjamin Bajželj
Journal: Int J Mol Sci Date: 2021-04-24 Impact factor: 5.923

4. CPANNatNIC software for counter-propagation neural network to assist in read-across.

Authors: Viktor Drgan; Špela Župerl; Marjan Vračko; Claudia Ileana Cappelli; Marjana Novič
Journal: J Cheminform Date: 2017-05-22 Impact factor: 5.514

5. QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations.

Authors: José R Valdés-Martiní; Yovani Marrero-Ponce; César R García-Jacas; Karina Martinez-Mayorga; Stephen J Barigye; Yasser Silveira Vaz d'Almeida; Hai Pham-The; Facundo Pérez-Giménez; Carlos A Morell
Journal: J Cheminform Date: 2017-06-07 Impact factor: 5.514

6. Scaffold-Hopping from Synthetic Drugs by Holistic Molecular Representation.

Authors: Francesca Grisoni; Daniel Merk; Ryan Byrne; Gisbert Schneider
Journal: Sci Rep Date: 2018-11-07 Impact factor: 4.379

7. An integrated quantitative structure and mechanism of action-activity relationship model of human serum albumin binding.

Authors: Angela Serra; Serli Önlü; Pietro Coretto; Dario Greco
Journal: J Cheminform Date: 2019-06-06 Impact factor: 5.514

8. Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.

Authors: Guillermin Agüero-Chapin; Deborah Galpert; Reinaldo Molina-Ruiz; Evys Ancede-Gallardo; Gisselle Pérez-Machado; Gustavo A de la Riva; Agostinho Antunes
Journal: Biomolecules Date: 2019-12-23

9. Choquet integral-based fuzzy molecular characterizations: when global definitions are computed from the dependency among atom/bond contributions (LOVIs/LOEIs).

Authors: César R García-Jacas; Lisset Cabrera-Leyva; Yovani Marrero-Ponce; José Suárez-Lezcano; Fernando Cortés-Guzmán; Mario Pupo-Meriño; Ricardo Vivas-Reyes
Journal: J Cheminform Date: 2018-10-25 Impact factor: 5.514

10. Tensor Algebra-based Geometrical (3D) Biomacro-Molecular Descriptors for Protein Research: Theory, Applications and Comparison with other Methods.

Authors: Julio E Terán; Yovani Marrero-Ponce; Ernesto Contreras-Torres; César R García-Jacas; Ricardo Vivas-Reyes; Enrique Terán; F Javier Torres
Journal: Sci Rep Date: 2019-08-06 Impact factor: 4.379

10 in total