Literature DB >> 33286565

Modelling and Recognition of Protein Contact Networks by Multiple Kernel Learning and Dissimilarity Representations.

Alessio Martino¹, Enrico De Santis¹, Alessandro Giuliani², Antonello Rizzi¹.

Abstract

Multiple kernel learning is a paradigm which employs a properly constructed chain of kernel functions able to simultaneously analyse different data or different representations of the same data. In this paper, we propose an hybrid classification system based on a linear combination of multiple kernels defined over multiple dissimilarity spaces. The core of the training procedure is the joint optimisation of kernel weights and representatives selection in the dissimilarity spaces. This equips the system with a two-fold knowledge discovery phase: by analysing the weights, it is possible to check which representations are more suitable for solving the classification problem, whereas the pivotal patterns selected as representatives can give further insights on the modelled system, possibly with the help of field-experts. The proposed classification system is tested on real proteomic data in order to predict proteins' functional role starting from their folded structure: specifically, a set of eight representations are drawn from the graph-based protein folded description. The proposed multiple kernel-based system has also been benchmarked against a clustering-based classification system also able to exploit multiple dissimilarities simultaneously. Computational results show remarkable classification capabilities and the knowledge discovery analysis is in line with current biological knowledge, suggesting the reliability of the proposed system.

Entities: CellLine Chemical Disease Species

Keywords: computational biology; dissimilarity spaces; kernel methods; protein contact networks; support vector machines; systems biology

Year: 2020 PMID： 33286565 PMCID： PMC7517365 DOI： 10.3390/e22070794

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Dealing with structured data is an evergreen challenge in pattern recognition and machine learning. Indeed, many real-world systems can effectively be described by structured domains such as networks (e.g., images [1,2]) or sequences (e.g., signatures [3]). Biology is a seminal field in which many complex systems can be described by networks [4], as the biologically relevant information resides in the interaction among constituting elements: common examples include protein contact networks [5,6], metabolic networks [7] and protein–protein interaction networks [8,9]. Pattern recognition in structured domains poses additional challenges as many structured domains are non-metric in nature (namely, the pairwise dissimilarities in such domains might not satisfy the four properties of a metric: non-negativity, symmetry, identity, triangle inequality) and patterns may lack any geometrical interpretation [10]. In order to deal with such domains, five mainstream approaches can be pursued [10]: Feature generation and/or feature engineering, where numerical features are extracted ad-hoc from structured patterns (e.g., using their properties or via measurements) and can be further merged according to different strategies (e.g., in a multi-modal way [11]); Ad-hoc dissimilarities in the input space, where custom dissimilarity measures are designed in order to process structured patterns directly in the input domain without moving towards Euclidean (or metric) spaces. Common—possibly parametric—edit distances include the Levenshtein distance [12] for sequence domains and graph edit distances [13] for graphs domains; Embedding via information granulation and granular computing [3,14,15,16,17,18,19,20,21,22,23,24,25]; Dissimilarity representations [26,27,28], where structured patterns are embedded in the Euclidean space according to their pairwise dissimilarities; Kernel methods, where the mapping between the original input space and the Euclidean space exploits positive-definite kernel functions [29,30,31,32,33]. This paper proposes a novel classification system based on an hybridisation of the latter two strategies: while dissimilarity representations see the (structured) patterns according to the pairwise dissimilarities, kernel methods encode pairwise similarities. Nonetheless, the class of properly-defined kernel functions is restricted: the (conditionally) positive definitiveness may not hold in case of non-metric (dis)similarities. The use of kernel methods in state-of-the-art (non-linear) classifiers such as Support Vector Machines (SVM) [34,35] is strictly related to their (conditionally) positive definitiveness due to the quadratic programming optimisation involved: indeed, non-(conditionally) positive definite kernels do not guarantee convergence to the global optimum. Although there is some research about learning from indefinite kernels (see, e.g., [36,37,38,39,40]), their evaluation on the top of Euclidean spaces (e.g., dissimilarity spaces) retain the (conditionally) positive definitiveness, devoting matrix regularisation or other tricks to foster positive definitiveness. The proposed classification system is able to simultaneously explore multiple dissimilarities following a multiple kernel learning approach, where each kernel considers a different (dissimilarity) representation. The relative importance of the several kernels involved is automatically determined via genetic optimisation in order to maximise the classifier performance. Further, the very same genetic optimisation is in charge of determining a suitable subset of representative (prototypes) patterns in the dissimilarity space [27] in order to shrink the modelling complexity. Hence, the proposed system allows a two-fold a posteriori knowledge discovery phase: By analysing the kernel weights, one can determine the most suitable representation(s) for the problem at hand; The patterns elected as representatives for the dissimilarity space (hence determined as pivotal for tracking the decision boundary amongst the problem-related classes) can give some further insights for the problem at hand. In order to validate the proposed classification system, a bioinformatics-related application is considered, namely protein function prediction. Proteins’ 3D structure (both tertiary and quaternary) can effectively be modelled by a network, namely the so-called Protein Contact Network (PCN) [5]. A PCN is a minimalistic (unweighted and undirected) graph-based protein representation where nodes correspond to amino-acids and edges between two nodes exist whether the Euclidean distance between residues’ -carbon atom coordinates is within Å. The lower bound is defined in order to discard trivial connections due to closeness along the backbone (first-order neighbour contacts), whereas the upper bound is defined by considering the peptide bonds geometry (indeed, 8Å roughly correspond to two van der Waals radii between residues’ -carbon atoms [41]). It is worth stressing that both nodes labels (i.e., the type of amino-acid) and edges labels (i.e., the distance between neighbour residues) are deliberately discarded in order to focus only on proteins’ topological configuration. Despite the minimalistic representation, PCNs have been successfully used in pattern recognition problems for tasks such as solubility prediction/folding propensity [42,43] and physiological role prediction [44,45,46]; furthermore, their structural and dynamical properties have been extensively studied in works such as [47,48,49,50]. In order to investigate how the protein function is related to its topological structure, a subset of the entire Escherichia coli bacterium proteome, correspondent to E. coli proteins whose 3D structure is known, is considered. The problem itself is cast into a supervised pattern recognition task, where each pattern (protein) is described according to eight different representations drawn by its PCN and its respective Enzyme Commission (EC) number [51] that serves as the ground-truth class label. The EC nomenclature scheme classifies enzymes according to the chemical reaction they catalyse and a generic entry is composed by four numbers separated by periods. The first digit (1–6) indicates one of the six major enzymatic groups (EC 1: oxidoreductases; EC 2: transferases; EC 3: hydrolases; EC 4: lyases; EC 5: isomerases; EC 6: ligases) and the latter three numbers represent a progressively finer functional enzyme classification. In this work, only the first number is considered. However, proteins with no enzymatic characteristics (or proteins for which enzymatic characteristics are still unknown nowadays) are not provided with an EC number, thus an additional class of not-enzymes will be considered, identified by the categorical label 7. It is worth noting that the EC classification only loosely relates to global protein 3D configuration, given that structure is affected by many determinants other than catalysed reactions like solubility, localisation in the cell, interaction with other proteins and so forth. This makes the classification task intrinsically very difficult. This paper is organised as follows: Section 2 overviews some theory related to kernel methods and dissimilarity spaces; Section 3 presents the proposed methodology; Section 4 shows the results obtained with the proposed approach, along with a comparison against a clustering-based classifier (also able to explore multiple dissimilarities), and we also provide some remarks on the two-fold knowledge discovery phase. Finally, Section 5 concludes the paper. The paper also features two appendices: Appendix A describes in detail the several representations used for describing PCNs, whereas Appendix B lists the proteins selected as prototypes for the dissimilarity representations.

2. Theoretical Background

Let be the dataset at hand lying in a given input space . Moving the problem towards a dissimilarity space [26] consists in expressing each pattern from according to the pairwise distances with respect to all other patterns, including itself. In other words, the dataset is cast into the pairwise distance matrix defined as: where is a suitable dissimilarity measure in , that is . Without loss of generality, hereinafter let us consider to be symmetric: if is at least symmetric, is trivially symmetric; in case of asymmetric dissimilarity measures, can be ‘forced’ to be symmetric, e.g., . The major advantage in moving the problem from a generic input space towards is that the latter can be equipped with algebraic structures such as the inner product or the Minkowski distance, whereas the former might not be metric altogether. As such, in the latter, standard computational intelligence and machine learning techniques can be used without alterations [10]. On the negative side, the explicit evaluation of can be computationally expensive as it leads to a time and space complexity of . To this end, in [27], a ‘reduced’ dissimilarity space representation is proposed, where a subset of prototype patterns is properly chosen and each pattern is described according to the pairwise distances with respect to the prototypes only. This leads to the definition of a ‘reduced’ pairwise distance matrix defined as: Since usually , there is no need to solve a quadratic complexity problem such as evaluating Equation (1). On the negative side, however, the selection of the subset is a delicate and challenging task [10] since: They must well-characterize the decision boundary between patterns in the input space; The fewer, the better: the number of representatives has a major impact on the model complexity (cf. Equation (1) vs. Equation (2)). Several heuristics have been proposed in the literature, ranging from clustering the input space to (possibly class-aware) random selection [10,27,52]. Kernel methods are usually employed whether the input space has an underlying Euclidean geometry. Indeed, the simplest kernel (namely, the linear kernel [30,53]) is the plain inner product between real-valued vectors. The kernel matrix (also known as the Gram matrix) can easily be defined as: Let K be a symmetric and positive semi-definite kernel function from the input space towards , that is such that As in the linear kernel case, starting from pairwise kernel evaluations, one can easily evaluate the kernel matrix as and if is a positive semi-definite kernel matrix, then K is a positive semi-definite kernel function. One of the most intriguing kernel methods property relies on the so-called kernel trick [29,30]: kernel of the form Equations (4) and (5) are also known as Mercer’s kernel as they satisfy the Mercer condition [32]. Such kernel functions can be seen as the inner product evaluation on a high-dimensional (or possibly infinite-dimensional) and usually unknown Hilbert space . The kernel trick is usually described by the following, seminal, equation: where is the implicit (and usually unknown) mapping function. The need for using a non-linear and higher-dimensional mapping is a direct consequence of Cover’s theorem [33]. Thanks to the kernel trick, one can use one of the many kernel functions available (e.g., polynomial, Gaussian, radial basis function) in order to perform such non-linear and higher-dimensional mapping without knowing and explicitly evaluating the mapping function . Further, kernel methods can be used in many state-of-the-art classifiers such as (kernelised) SVM [35,54]. In multiple kernel learning, the kernel matrix is defined as a properly-defined combination of a given number of kernels. The most intuitive combination is a linear combination of the form: where sub-kernels are single Mercer’s kernels. The weights can be learned according to different strategies and can be constrained in several ways—see, e.g., [55,56,57,58,59,60,61], or the survey [62]. The rationale behind using a multiple kernel learning with respect to a plain single kernel learning depends on the application: for example, if data come from different sources, one might want to explore such different sources according to several kernels or, dually, one might want to explore the same data using different kernels, where such different kernels may differ in shape and/or type. In this work, a mixture between the two approaches is pursued: same source (PCN), but different representations (see Appendix A). Further, a linear convex combination of radial basis function kernels is employed. The ith radial basis function kernel is defined as and is its shape parameter. Further, the weights are constrained as It is rather easy to demonstrate that these selections for both kernels and weights lead to the final kernel matrix (as in Equation (8)) which still is a valid Mercer’s kernel, therefore it can be used on kernelised SVMs. Indeed, Cristianini and Shawe-Taylor in [31] showed that the summation of two valid kernels is still a valid kernel. Further, Horn and Johnson in [63] showed that a positive semi-definite matrix multiplied by a non-negative scalar is still a positive semi-definite matrix. Merging these two results automatically prove that kernels of the form (8) and (9) with constraints (10) and (11) are valid kernels.

3. Proposed Methodology

Let be the dataset at hand, split into three non-overlapping subsets , and (namely training set, validation set and test set). Especially for structured data, several representations (e.g., set of descriptors) might hold for the same data, therefore let be the set of representations, split in the same fashion (i.e., , and ). Finally, let be the set of dissimilarity measures suitable for working in their respective representations. The respective training, validation and test pairwise dissimilarity matrices, as in Equation (1) can be evaluated as follows: Let be a binary vector in charge of selecting columns from all matrices in Equation (12): the full pairwise dissimilarities can be sliced to their ‘reduced’ versions (cf. Equation (1) vs. Equation (2)), hence: where, due to the number of subscripts and superscripts in Equation (13), for ease of notation, we used a MATLABmathsizesmall®-like notation for indexing matrices. In other words, acts as a feature (prototype) selector. Given this newly obtained dataset, it is possible to train a kernelised -SVM [64] whose multiple kernel has the form Equation (9) where each one has the form Equation (9), thus: where ⊝ denotes the pairwise difference. Hence, each dissimilarity representation is subject to a proper non-linear kernel (). A genetic algorithm [65] acts as a wrapper method in order to automatically tune in a fully data-driven fashion the several free parameters introduced in this problem. The choice behind a genetic algorithm stems from them being widely famous in the context of derivative-free optimisation, embarrassingly easy to parallelise and for the sake of consistency with competing techniques (see Section 4.4). For our problem, the genetic code has the form: where is the SVM regularisation term, contains the kernel weights, contains the kernel shapes and properly selects prototypes in the dissimilarity space, as described above. For the sake of argument, it is worth remarking that there have been several attempts to use evolutionary strategies in order to tune multiple kernel machines: for example in [66] a genetic algorithm has been used in order to tune the kernel shapes (namely, ), whereas in [67] both the kernel shapes and the kernel weights have been tuned by means of a evolution strategy [68]. Conversely, the idea of using a genetic algorithm for prototypes selection in the dissimilarity space has been inherited from a previous work [44]. The fitness function to be maximised is the informedness J (also known as Youden’s index [69]) defined as: which is, by definition, bounded in range (the closer to 1, the better). For the sake of comparison with other performance measures (e.g., accuracy, F-score and the like) which are, by definition, bounded in , the fitness function sees a scaled version of the informedness [23,24,25], hence: The rationale behind using the informedness rather than other most common performance measures (mainly accuracy and F-score) is that the informedness is well suited for unbalanced classes without being biased towards the most frequent class (the same is not true for accuracy) and whilst considering also true negative predictions (the same is not true for F-score) [70]. By assuming that the full dissimilarity matrices are pre-evaluated beforehand, the objective function evaluation is performed for each individual from the current generation as follows: The individual receives the full dissimilarity matrices between training data samples, i.e., as in Equation (12); According to the portion of its genetic code (see Equation (15)), a subset of prototypes is selected, leading to the ‘reduced’ dissimilarity matrices between training data, i.e., as in Equation (13); Considering the and values in its genetic code, the (multiple) kernel matrix is evaluated by using Equation (14); A -SVM is trained using the regularisation term from the genetic code and the kernel matrix from step #3; The individual receives the full dissimilarity matrices between training and validation data, each of which is computed by considering all possible -pairs where x belongs to the validation set and y belongs to the training set, i.e., as in Equation (12); The ‘reduced’ dissimilarity matrices are projected thanks to , i.e., as in Equation (13); The (multiple) kernel matrix between training and validation data is evaluated thanks to and , alike Equation (14); The (multiple) kernel matrix from step #7 is fed to the SVM trained on step #4 and the predicted classes on the validation set are returned; The fitness function is evaluated. At the end of the evolution, the best individual (i.e., the one with best performances on the validation set) is retained and its final performances are evaluated on the test set. Finally, it is worth remarking the rationale behind the proposed, structured, genetic code since a genetic code of the form Equation (15) allows, in a two-fold manner, a deeper a posteriori knowledge discovery phase. Indeed, using upfront good classification results (for the sake of reliability), by looking at , it is possible to check which kernels (representations) are considered as the most important (higher weights) for the learning machine in order to solve the problem at hand. Similarly, by looking at , it is possible to check which training set patterns have been selected as representatives and ask why those patterns have been selected instead of others, leading to a pattern-wise check (possibly with help by field-experts). Especially the latter a posteriori check might be troublesome if a huge number of representatives is selected. In order to alleviate this problem (if present), it is possible to re-state the fitness function (formerly (17)) by considering a convex linear combination between the performance index and the feature selector sparsity, hence: where in a user-defined parameter which tunes the convex linear combination by weighting the rightmost term (sparsity) against the leftmost term (performance). It is worth noting that whilst fitness (17) should be maximised, (18) should be minimised.

4. Tests and Results

4.1. Data Collection and Pre-Processing

The data retrieval processing can be summarised as follows. Using the Python BioServices library [71]: The entire protein list for Escherichia coli str. K12 has been retrieved from UniProt [72]; This list has been cross-checked with Protein Data Bank [73] in order to discard unresolved proteins (i.e., proteins whose 3D structure is not available). Then, using the BioPython library [74]: .pdb files have been downloaded for all resolved proteins; information such as the EC number and the measurement resolution (if present) have been parsed from the .pdb file header; proteins having multiple EC numbers have been discarded. Finally, using the BioPandas library [75]: -carbon atoms 3D coordinates have been parsed from each .pdb file; In case of multiple equivalent models within the same .pdb file, only the first model is retained; Similarly, for atoms having alternate coordinate locations, only the first location is retained. After this retrieval stage, a total number of 6685 proteins has been successfully collected. Some statistics on the measurement resolutions and the number of nodes are sketched in Figure 1a,b, respectively.

Figure 1

Distributions within the original 6685 proteins set.

In order to keep only good quality structures (with reliable atomic coordinates), all proteins with missing resolution in their respective .pdb files and proteins whose resolution is greater than 3Å have been discarded. Further, proteins having more than 1500 nodes have been discarded as well. These filtering procedures dropped the number of available proteins from 6685 to 4957. The class labels (EC number) distribution is summarised in Table 1.

Table 1

Classes distribution within the filtered 4957 proteins set.

								Total
Class	EC1	EC2	EC3	EC4	EC5	EC6	not-enzymes
Count	540	1017	919	329	182	244	1726	4957
Percentage	10.89	20.52	18.54	6.64	3.67	4.92	34.82	100%

For each of the 4957 available proteins, its respective eight representations (see Appendix A) have been evaluated using the following tools: The NetworkX library [76] (Python) for evaluating centrality measures () and the Vietoris–Rips complex (); The Numpy and Scipy libraries [77,78] (Python) for several algebraic computations, mainly spectral decompositions for energy, Laplacian energy, heat trace, heat content invariants (, , , ) and the homology group rank (); The Rnetcarto (https://cran.r-project.org/package=rnetcarto) library (R) for network cartography (). As in previous works [45,46] the 7-class classification problem is cast into seven binary classification problems in one-against-all fashion, hence the ith classifier sees the ith class as positive and all other classes as negative. The eight representations are split into training, validation and test set in a stratified manner in order to preserve labels’ distribution across splits. Thus, each of the seven classifiers sees a different training-validation-test split due to the one-against-all labels recoding. The genetic optimisation and classification stage has been performed in MATLABmathsizesmall® R2018a using the built-in genetic algorithm and LibSVM [79] for -SVMs.

4.2. Computational Results with Fitness Function

The first test suite sees (17) as the fitness function, hence the system aims at the maximisation of the (normalised) informedness. The genetic algorithm has been configured to host 100 individuals for a maximum of 100 generations and each individual’s genetic code (upper/lower bounds and constraints, if any) is summarised in Table 2. At each generation, the elitism is set to the top 10% individuals; the crossover operates in a scattered fashion; the selection operator follows the roulette wheel heuristic and the mutation adds to each real-valued gene () a random number extracted from a zero-mean Gaussian distribution whose variance shrinks as generations go by, whereas it acts in a flip-the-bit fashion for boolean-valued genes ().

Table 2

Genetic algorithm parameters description.

Parameter	Bounds	Contraints
ν	(0,1] by definition
β	βi∈[0,1],∀i=1,…,NR	∑i=1NRβi=0
γ	γ∈(0,100],∀i=1,…,NR
w	wi∈{0,1},∀i=1,…,\|DTR\|

Table 3 shows the performances obtained by the proposed Multiple Kernels over Multiple Dissimilarities (MKMD, for short) approach using the fitness function . Due to randomness in genetic optimisation, five runs have been performed for each classifier and the average results are shown. Figures of merit include: where , , and indicate true positives, true negatives, false positives and false negatives, respectively.

Table 3

Test Set Performances with Fitness Function .

Class	Performances					Complexity
	Accuracy	Precision	Recall	Informedness †	AUC	Sparsity
1 (EC1)	0.95	0.87	0.68	0.83	0.92	49.43
2 (EC2)	0.91	0.88	0.66	0.82	0.90	49.62
3 (EC3)	0.90	0.84	0.58	0.78	0.88	49.48
4 (EC4)	0.97	0.90	0.56	0.78	0.88	49.42
5 (EC5)	0.98	0.83	0.44	0.72	0.78	50.78
6 (EC6)	0.99	0.94	0.76	0.88	0.95	49.28
7 (not-enzymes)	0.82	0.77	0.70	0.79	0.89	50.52

Normalised.

; ; ; as in Equation (17); (AUC), namely the area under the Receiver Operating Characteristic (ROC) curve [80]; Similarly, Figure 2 shows the ROC curves for all classifiers by considering their respective run with greatest AUC.

Figure 2

ROC curves with fitness function . In brackets, the respective AUC values.

4.3. Computational Results with Fitness Function

These experiments see the fitness function (Equation (18)) in lieu of (Equation (17)), where the weighting parameter is set to in order to give the same importance to performances and sparsity. In order to ensure a fair comparison with the previous analysis, the same training-validation-test splits have been used for all seven classifiers, along with the same genetic algorithm setup (genetic code, number of individuals and generations, genetic operators). Table 4 shows the average performances obtained by the seven classifiers across five genetic algorithm runs. As in the previous case, Figure 3 shows the ROC curves for all classifiers by considering their respective run with greatest AUC.

Table 4

Test set performances with fitness function and .

Class	Performances					Complexity
	Accuracy	Precision	Recall	Informedness †	AUC	Sparsity
1 (EC1)	0.95	0.86	0.69	0.84	0.92	33.08
2 (EC2)	0.91	0.88	0.67	0.82	0.90	32.48
3 (EC3)	0.90	0.83	0.57	0.77	0.87	29.94
4 (EC4)	0.97	0.88	0.54	0.77	0.88	33.89
5 (EC5)	0.98	0.85	0.45	0.73	0.79	35.54
6 (EC6)	0.98	0.91	0.76	0.88	0.95	35.38
7 (not-enzymes)	0.82	0.77	0.69	0.79	0.88	33.37

Normalised.

Figure 3

ROC curves with fitness function and . In brackets, the respective AUC values.

4.4. Benchmarking against a Clustering-Based One-Class Classifier

In order to properly benchmark the proposed MKMD system, a One-Class Classification System (hereinafter OCC or OCC_System) capable of exploiting multiple dissimilarities is used. This classification system has been initially proposed in [81] and later used for modelling complex systems such as smart grids [81,82,83] and protein networks [44]. The main idea in order to build a model through the One-Class Classifier is to use a clustering-evolutionary hybrid technique [81,82]. The main assumption is that similar protein types have similar chances of generating a specific class, reflecting the cluster model. Therefore, the core of the recognition system is a custom-based dissimilarity measure computed as a weighted Euclidean distance, that is: where are two generic patterns and is a diagonal matrix whose elements are generated through a suitable vector of weights . The dissimilarity measure is component-wise, therefore the ⊖ symbol represents a generic dissimilarity measure, tailored on each pattern subspace, that has to be specified depending on the semantic of data at hand. In this study, patterns are represented by dissimilarity vectors extracted from each sub-dissimilarity matrix, one for each feature adopted to describe the protein (see Section 2). In other words, patterns pertain to a suitable dissimilarity space. The decision region of each cluster is constructed around the medoid bounded by the average radius plus a threshold , considered together with the dissimilarity weights as free parameters. Given a test pattern the decision rule consists in evaluating whether it falls inside or outside the overall target decision region, by checking whether it falls inside the closest cluster. The learning procedure consists in clustering the training set composed by target patterns, adopting a standard genetic algorithm in charge of evolving a family of cluster-based classifiers considering the weights and the thresholds of the decision regions as search space, guided by a proper objective function. The latter is evaluated on the validation set , taking into account a linear combination of the accuracy of the classification (that we seek to maximise) and the extension of the thresholds (that should be minimised). Note that in building the classification model we use only target patterns, while non-target ones are used in the cross-validation phase, hence the adopted learning paradigm is the One-Class classification one [84,85]. Moreover, in order to outperform the well-known limitations of the initialization of the standard k-means algorithm, the OCC_System initializes more than one instance of the clustering algorithm with random starting representatives, namely medoids, since the OCC_System is capable of dealing with arbitrarily structured data [86,87,88]. At test stage (or during validation) a voting procedure for each cluster model is performed. This technique allows building a more robust proteins model. Figure 4 shows the schematic representing the core subsystems of the proposed OCC_System, such as the ones performing the clustering procedure and the genetic algorithm. Moreover, it is shown the Test subsystem, where given a generic test pattern and given a learned model, it is possible to associate a score value (soft-decision) besides the Boolean decision. Hence, we equip each cluster with a suitable membership function, denoted in the following as . In practice, we generate a fuzzy set [89] over . The membership function allows quantifying the uncertainty (expressed by the membership degree in ) of a decision about the recognition of a test pattern. Membership values close to either 0 or 1 denote “certain” and hence reliable decisions. When the membership degree assigned to a test pattern is close to 0.5, there is no clear distinction about the fact that such a test pattern is really a target pattern or not (regardless of the correctness of the Boolean decision).

Figure 4

Schematic of the classification system able to learn a classification model for each positive class. The model provides the crisp decision as well as a score (a real number) encoding the decision reliability.

For this purpose, we adopt a parametric sigmoid model for , which is defined as follows: where are two parameters specific to , and is the dissimilarity measure (19). Notably, is used to control the steepness of the sigmoid (the lower the value, the faster the rate of change), and is used to translate the function in the input domain. If a cluster (that models a typical protein found in the training set) is very compact, then it describes a very specific scenario. Therefore, no significant variations should be accepted to consider test patterns as members of this cluster. Similarly, if a cluster is characterised by a wide extent, then we might be more tolerant in the evaluation of the membership. Accordingly, the parameter is set equal to . On the other hand, we define . This allows us to position the part of the sigmoid that changes faster right in-between the area of the decision region determined by the dissimilarity values falling in , where in turn is the boundary of the decision region related to the ith cluster. Finally, the soft decision function, , is defined as where is the cluster where the test (target) pattern falls. With the aim of making a synthesis, we remark that the OCC_System works in two phases: Learning a cluster model of proteins through a suitable dataset divided into two disjoint sets, namely training and validation set; Using the learned model in order to recognise or classify unseen proteins drawn from the test set, assigning to each pattern a probability value. The OCC parameters defining the model are optimised by means of a genetic algorithm guided by a suitable objective function that takes into account the classification accuracy. For the sake of comparison, the same genetic operators (selection, mutation, crossover, elitism) as per the MKMD system and have been considered (see Section 4.2). As concerns the complexity of the model, measured as the cardinality of the partition k, we choose a suitable value . Table 5 shows the comparison between the OCC_System and the MKMD approach. In order to ensure a fair comparison, since the OCC_System does not perform representatives selection in the dissimilarity space, in the MKMD genetic code (cf. Equation (15)), the weights vector has been removed and all weights have been considered unitary (i.e., no representative selection). Similarly, Figure 5b and Figure 5a show the ROC curves for OCC and MKMD, respectively.

Table 5

Test set performances with the one-class classifier.

Class	Classifier	Performances
		Accuracy	Precision	Recall	Informedness †	AUC
1 (EC1)	OCC	0.92	0.97	0.35	0.67	0.85
1 (EC1)	MKMD	0.95	0.88	0.67	0.83	0.91
2 (EC2)	OCC	0.83	0.87	0.45	0.69	0.76
2 (EC2)	MKMD	0.91	0.89	0.66	0.82	0.91
3 (EC3)	OCC	0.83	0.86	0.49	0.70	0.77
3 (EC3)	MKMD	0.90	0.84	0.57	0.77	0.88
4 (EC4)	OCC	0.68	0.60	0.78	0.61	0.72
4 (EC4)	MKMD	0.97	0.89	0.53	0.76	0.87
5 (EC5)	OCC	0.85	0.75	0.37	0.62	0.69
5 (EC5)	MKMD	0.98	0.82	0.44	0.72	0.78
6 (EC6)	OCC	0.97	0.96	0.57	0.78	0.88
6 (EC6)	MKMD	0.99	0.92	0.77	0.88	0.95
7 (not-enzymes)	OCC	0.68	0.60	0.78	0.61	0.72
7 (not-enzymes)	MKMD	0.82	0.78	0.68	0.79	0.88

Normalised.

Figure 5

ROC curves comparison (best run for all classes). In brackets, the respective AUC values.

From Table 5 is evident that MKML outperforms OCC in terms of accuracy, informedness and AUC (see also the ROC curves in Figure 5b and Figure 5a), but a clear winner does not exist as regards precision and recall. As regards the structural complexity, OCC is bounded by the number of clusters k, whereas MKMD is bounded by the number of support vectors as returned by the training phase [24]. Indeed, the computational burden required to classify new test data is given by: The pairwise distances between the test data and the k clusters centres (for OCC); The dot product between the test data and the support vectors (for MKMD). Specifically, for OCC, a suitable number of 120 clusters has been defined for all classes, whereas the training phase for MKMD returned an average of 1300 support vectors (∼52% of the training data) for class 1, 1881 support vectors (∼76%) for class 2, 1745 support vectors (∼70%) for class 3, 1213 support vectors (∼49%) for class 4, 767 support vectors (∼31%) for class 5, 864 support vectors (∼35%) for class 6 and 1945 support vectors (∼78%) for class 7. In conclusion, whilst MKMD outperforms OCC in terms of performances, the latter outperforms the former in terms of structural complexity.

4.5. Comparing against Previous Works

In Table 6 are reported the performances (in terms of AUC only, for the sake of shorthand) between the proposed MKMD approach with fitness function (Table 3), with fitness function (Table 4) and with no representatives selection in the embedding space (Table 5) against our previous studies for solving the same classification problem. For the sake of completeness, the results obtained by OCC (Table 5) are also included.

Table 6

Comparison (in terms of AUC) between the proposed MKMD approach and previous studies.

Approach	EC1	EC2	EC3	EC4	EC5	EC6	Not-Enzymes
DME + Logistic Regression [44]	–	–	–	–	–	–	0.62
DME + SVM [44]	–	–	–	–	–	–	0.64
DME + Naïve Bayes [44]	–	–	–	–	–	–	0.62
DME + Decision Tree [44]	–	–	–	–	–	–	0.60
DME + Neural Network [44]	–	–	–	–	–	–	0.63
OCC [44]	–	–	–	–	–	–	0.63
Feature Generation via Betti Numbers + SVM [46]	0.79	0.75	0.73	0.73	0.46	0.77	0.77
Feature Generation via Spectral Density + SVM [45]	0.85	0.82	0.85	0.81	0.59	0.81	0.82
MKMD with f1 (Table 3)	0.92	0.90	0.88	0.88	0.78	0.95	0.89
MKMD with f2 (Table 4)	0.92	0.90	0.87	0.88	0.79	0.95	0.88
MKMD with no representative selection (Table 5)	0.91	0.91	0.88	0.87	0.78	0.95	0.88
OCC (Table 5)	0.85	0.76	0.77	0.72	0.69	0.88	0.72

In [44], two experiments have been performed: the first relied on the Dissimilarity Matrix Embedding (DME) by considering different protein representations (similar to the ones considered in this work) and the second one relied on OCC being able to explore those different representations simultaneously (alike this work). There are three main differences between this work and [44]: first, the set of representations is different; second, we only managed to solve the binary classification problem between enzymes and not-enzymes; third, the set of considered proteins is different. In fact, in [44], we performed an additional filtering stage in order to select (for the same UniProt ID) only the PDB entry with best resolution: we found that this heavily limits the number of protein samples available, possibly reducing the learning capabilities. In [45,46] we used the sampled spectral density of the protein contact networks (more information can be found in Appendix A.8) and the Betti numbers (more information can be found in Appendix A.1), respectively: the results in Table 6 feature the same proteins set used in this work. Indeed, thanks to the observation above, experiments have been repeated with an augmented number of protein samples [90,91]. Results in Table 6 highlight that: Avoiding to filter out PDB structures by considering only the best resolution for a given UniProt ID (as carried out also in this work) helps in improving classification models: indeed, performances from [44] are amongst the lowest ones; The proposed MKMD approach, regardless of the fitness function and/or representative selection, outperforms all competitors for all EC classes (including not-enzymes).

4.6. On the Knowledge Discovery Phase

Apart from the good generalisation capabilities, it is worth remarking that an interesting aspect of the proposed multiple kernel approach is the two-fold knowledge discovery phase: By analysing the kernel weights , it is possible to determine the most important representations for the problem at hand; By analysing , namely the binary vector in charge of selecting prototypes from the dissimilarity space, it is possible to determine and analyse the patterns (proteins, in this case) elected as prototypes. Let us start our discussion from the latter point. From a chemical viewpoint, proteins are linear hetero-polymers in the form of non-periodic sequences of 20 different monomers (amino-acids residues). While artificial polymers (periodic) are very large extended molecules forming a matrix, the majority of proteins fold as self-contained water-soluble structures. Thus, we can consider the particular linear arrangement of amino-acid residues as a sort of ‘recipe’ for making a water-soluble polymer with a well-defined three-dimensional architecture [92]. “Well-defined three-dimensional structure” should not be intended as a ‘fixed architecture’: many proteins appear as partially or even totally disordered when analysed with spectroscopic methods. This apparent disorder corresponds to an efficient organisation as for protein physiological role giving to the molecule the possibility to adapt to rapidly changing microenvironment conditions [93]. This implies the two main drivers of amino-acid residues 3D arrangement (from where the particular properties of relative contact networks derive) are: To efficiently accomplish the task of being water soluble while maintaining a stable structure (or dynamics); To allow for an efficient spreading of the signal across amino-acid residues contact network so to sense relevant microenvironment changes and to reshape accordingly—allosteric effect, see [94]. Currently, we have only a coarse-grain knowledge of such complex tasks, and biochemists are still very far to be able to reproduce this behaviour by synthetic constructs. The ability to catalyse a specific class of chemical reactions (the property the EC classification is based upon), while being crucial for the biological role of protein molecules is, from the point of view of topological and geometrical proteins structure, only a very minor modulation of their global shape [92]. Notwithstanding that, the thorough analysis of representative proteins (thus pivotal for discrimination) can give us some general hints, not only confined to the specific classification task, but extending to all the ’hard’ classification problems based upon very tiny details of the statistical units. Looking at the representative proteins (hence, endowed with meaningful discriminative power) in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7 (Appendix B) we immediately note that the pivotal proteins come from all the analysed EC categories and not only from the specific class to be discriminated. This is expected by the absence of a simple form-function relation, hence they can be considered as an ‘emergent property’ of the discrimination task. The presence of molecules of different classes crucial for a specific category modelling and thus the image in light of a peculiar strategy adopted by the system is analogue to the use of ‘paired samples’ in statistical investigation [95,96]. When in presence of only minor details discriminating statistical units pertaining to different categories, the only possibility to discriminate is to adopt a paired samples strategy in which elements of a category is paired with a very similar example of another category so to rely on their differences (on a sample-by-sample basis) instead of looking for a general ‘class-specific’ properties. This is the case of proteins whose general shape is only partially determined by the chemical reaction they catalyse: looking at the 3D structures of relevant proteins, we can easily verify they pertain to three basic patterns (Figure 6):

Table A1

Selected proteins in order to discriminate EC 1 (oxidoreductases) vs. all the rest.

PDB ID	Notes/Description
1KOF	Transferase
1XFG	Transferase
3E2R	Oxydoreductase
4TS9	Transferase
1ZDM	Signalling Protein
1MPG	Hydrolase
1QQQ	Transferase

Table A2

Selected proteins in order to discriminate EC 2 (transferases) vs. all the rest.

PDB ID	Notes/Description
3EDC	LAC repressor (signalling protein)
1DKL	Hydrolase
1JKJ	Ligase
2DBI	Unknown function
3UCS	Chaperone
1LX7	Transferase
2GAR	Transferase
3ILI	Transferase
1S08	Transferase
4IXM	Hydrolase
4XTJ	Isomerase
1KW1	Lyase
1BDH	Transcription factor (DNA-binding)
4PC3	Elongation factor (RNA-binding)
5G1L	Isomerase

Table A3

Selected proteins in order to discriminate EC 3 (hydrolases) vs. all the rest.

PDB ID	Notes/Description
4RZS	Transcription factor (signalling protein)
1ZDM	Signalling protein
3I7R	Lyase
1HW5	Signalling protein
1SO5	Lyase

Table A4

Selected proteins in order to discriminate EC 4 (lyases) vs. all the rest.

PDB ID	Notes/Description
2BWX	Hydrolase
3UWM	Oxydoreductase
2H71	Electron transport
1D7A	Lyase
4DAP	DNA-binding
1SPV	Structural genomics, unknown function
1EXD	Ligase + RNA-binding
1X83	Isomerase
3ILJ	Transferase
2D4U	Signalling protein
1JNW	Oxydoreductase
1TRE	Oxydoreductase
1ZPT	Oxydoreductase
3LGU	Hydrolase
1IB6	Oxydoreductase
3C0U	Structural genomics, unknown function
5GT2	Oxydoreductase
2RN2	Hydrolase
4L4Z	Transcription regulator
3CMR	Hydrolase
1NQF	Transport protein
1GPQ	Hydrolase
4ODM	Isomerase + chaperone
2NPG	Transport protein
2UAG	Ligase
1OVG	Transferase
3AVU	Transferase
1RBV	Hydrolase
5AB1	Cell adhesion
1TMM	Transferase
4NIY	Hydrolase
4WR3	Isomerase

Table A5

Selected proteins in order to discriminate EC 5 (isomerases) vs. all the rest.

PDB ID	Notes/Description
4ITX	Lyase
2BWW	Hydrolase
5IU6	Transferase
1ODD	Gene regulatory
5G5G	Oxydoreductase
1G7X	Transferase
2E0Y	Transferase
2SCU	Ligase
1HO4	Hydrolase
3RGM	Transport Protein
1OAC	Oxydoreductase
5MUC	Oxydoreductase
3OGD	Hydrolase + DNA binding
4K34	Membrane protein
1Q0L	Oxydoreductase
1G58	Isomerase
5M3B	Transport protein
2WOH	Oxydoreductase
2PJP	Translation regulation (RNA-binding)

Table A6

Selected proteins in order to discriminate EC 6 (ligases) vs. all the rest.

PDB ID	Notes/Description
2OLQ	Lyase
1JDI	Isomerase
4NIG	Oxydoreductase + DNA-binding
5T03	Transferase
5FNN	Oxydoreductase
2Z9D	Oxydoreductase
2V3Z	Hydrolase
4ARI	Ligase + RNA-binding
3LBS	Transport protein
4QGS	Oxydoreductase
5B7F	Oxydoreductase
2ABH	Transferase

Table A7

Selected proteins in order to discriminate not-enzymes vs. all the rest.

PDB ID	Notes/Description
1SPA	Transferase
2YH9	Membrane protein
1NQF	Transport protein
1LDI	Transport protein
1TIK	Hydrolase
1MWI	Hydrolase + DNA-binding
1GEW	Transferase
5CKH	Hydrolase
3ABQ	Lyase
3B6M	Oxydoreductase

Figure 6

Three basic patterns in protein 3D structures. (a) Transferase—PDB ID 1KOF, (b) Proline dehydrogenase (oxidoreductase)—PDB ID 3E2R, (c) Transport Protein (Non-Enzyme)—PDB ID 3RGM.

Cyclic pattern with an approximately spherical symmetry (Figure 6a); A globular pattern with ‘duplication’: protein can be considered as two identical half-structures (Figure 6b); Elongated non-cyclic pattern, typical of membrane-bound proteins (Figure 6c). Even if the three above-mentioned patterns have slightly different relative frequencies in the EC classes (e.g., pattern 3 is more frequent in non-enzymatic proteins), they are present in all the analysed classes so allowing for the ‘between-categories’ sample-by-sample pairing mentioned above. This peculiar situation is in line with current biochemical knowledge (minimal effect exerted by catalysed reaction on global structure) and it is a relevant proof-of-concept of both the reliability of the classification solution and of the power of the proposed approach. On the other hand, it is very hard to de-convolve the discriminating structural nuances from the obtained solution that, as it is, only confirms the presence of ‘tiny and still unknown’ structural details linked to the catalytic activity of the studied molecules. As regards the former point, Figure 7 shows the average weights vector across the aforementioned five runs for , showing that the MKMD approach considers for almost all classes centrality measures () and the protein size () as the most relevant representations, followed by the Betti numbers sequence (), heat content invariants () and heat kernel trace ().

Figure 7

Average kernel weights vectors .

It is worth noting that enzymes have a more pronounced allosteric effect with respect to non-enzymatic structures. This is a consequence of the need to modulate chemical kinetics according to microenvironment conditions—allostery is the modulating effect of a modification happening in a site different from catalytic site on the efficiency of the reaction [97]. Allostery implies an efficient transport of the signal along protein structure and it was discovered to be efficiently interpreted in terms of PCN descriptors [98] thus, the observed kernel weights fit well with the current biochemical knowledge.

5. Conclusions

In this paper, we proposed a classification system able to explore simultaneously multiple representations following an hybridisation between multiple kernel learning and dissimilarity spaces, hence exploiting the discriminative power of kernel methods and the customisability of dissimilarity spaces. Specifically, several representations are treated using their respective dissimilarity representations and combined in a multiple kernel fashion, where each kernel function considers a specific dissimilarity representation. A genetic algorithm (although any derivative-free evolutive metaheuristic can be placed instead) is able to simultaneously select suitable representatives in the dissimilarity space and tune the kernel weights, allowing a two-fold a posteriori knowledge discovery phase regarding the most suitable representations (higher kernel weights) and the patterns elected as prototypes in the dissimilarity space. The proposed MKMD system has been applied for solving a real-world problem, namely protein function prediction, with satisfactory results, greatly outperforming our previous works in which graph-based descriptors extracted from PCNs have been tested for solving the very same problem. Further, the proposed system has been benchmarked against a One-Class Classifier, also able to simultaneously explore multiple dissimilarities: whilst the former outperforms the latter in terms of accuracy, AUC and informedness, a clear winner between the two methods does not exist in terms of precision and recall. As far as the two-fold knowledge discovery phase for the proposed application is concerned, results both in terms of selected representatives in the dissimilarity space and weights automatically assigned to different representations are in line with current biological knowledge, showing the reliability of the proposed system. Furthermore, due to its flexibility, the proposed system can be applied to any input domain (not necessarily graphs), provided that several representations can be extracted by the structured data at hand and that suitable dissimilarity measures can be defined for such heterogeneous representations.

29 in total

Modelling and Recognition of Protein Contact Networks by Multiple Kernel Learning and Dissimilarity Representations.

1. Introduction

2. Theoretical Background

3. Proposed Methodology

4. Tests and Results

4.1. Data Collection and Pre-Processing

4.2. Computational Results with Fitness Function

4.3. Computational Results with Fitness Function

4.4. Benchmarking against a Clustering-Based One-Class Classifier

4.5. Comparing against Previous Works

4.6. On the Knowledge Discovery Phase

5. Conclusions

1. The Protein Data Bank.

2. Feature space interpretation of SVMs with indefinite kernels.

3. Subgraph centrality in complex networks.

4. The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain.

5. Eigenvector centrality for characterization of protein allosteric pathways.

6. Metabolic networks classification and knowledge discovery by information granulation.

7. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

8. A novel multi-modal machine learning based approach for automatic classification of EEG recordings in dementia.

9. Hierarchy measure for complex networks.

10. Why network approach can promote a new way of thinking in biology.

1. Computation in Complex Networks.

2. On component-wise dissimilarity measures and metric properties in pattern recognition.