Literature DB >> 32019472

Constraining classifiers in molecular analysis: invariance and robustness.

Ludwig Lausser¹, Robin Szekely¹, Attila Klimmek¹, Florian Schmid¹, Hans A Kestler^1,2.

Abstract

Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik-Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.

Entities: Chemical Disease Gene Mutation Species

Keywords: classification; computational learning theory; invariances; molecular profiles

Year: 2020 PMID： 32019472 PMCID： PMC7061712 DOI： 10.1098/rsif.2019.0612

Source DB: PubMed Journal: J R Soc Interface ISSN： 1742-5662 Impact factor: 4.118

Introduction

Accurate and interpretable diagnostic models are a major ingredient in modern healthcare and a key component in personalized medicine [1,2]. They facilitate the identification of optimal therapies and individual treatments. These models are derived in long-lasting and cost-intensive data-driven processes, which are based on the analysis of high-dimensional marker profiles. In general, these search spaces exceed by far the possibility of manual inspection. Computer-aided systems are required for these screening procedures. The canonical machine learning approach for deriving diagnostic classification models is the supervised learning scheme [3-5]. Here, a predictive model, a classifier, abstracts diagnostic classes from a set of labelled training examples. Due to the data-driven nature of this learning process, the quality of a classifier is naturally dependent on the quality and amount of available samples. It can affect the generalizability and interpretability of a model. Both characteristics are of importance for the clinical setting. An incorrect prediction can lead to an incorrect treatment decision. A non-interpretable model is not verifiable and does not provide new insights in the molecular background of a disease. Small data collections might be supplemented by existing domain knowledge on the corresponding classification task or the recording process. It can provide information about hidden relationships or dependencies, which are too complex to be extracted from the data itself [6,7]. This information can structure the training process of a classification model, increasing both its accuracy and interpretability [8,9]. In the following, we focus on incorporating invariances into classification models [10]. Other approaches focus on regression applications [11,12]. That is the classification model and its predictions should not be affected by a specific data transformation. Typically, the terms invariance and tolerance are distinguished [13]. An invariant classifier completely neglects the influence of a data transformation; a tolerant one only reduces its influences. Invariances can be gained by model restrictions [14] or by initial data transformations [15,16]. They can also be enforced during the training process of a classifier [17-20]. For example, invariances can be learned by incorporating additional artificial samples in the training process of a classification model [21,22]. Here, we impose invariance as a property of the underlying concept class of a classifier [23,24]. We generate four subclasses of linear classifiers that directly induce invariances to different data transformations (figure 1). Their structural characteristics are shown in figure 2 and listed in table 1. The theoretical properties of the concept classes and their implications on model complexity are elaborated in §2. The performance of invariant classifiers is evaluated in experiments with artificial datasets and gene expression profiles (§3). The corresponding results are shown in §4 and discussed in §5.

Figure 1.

Figure 2.

Structural properties of invariant linear classifiers: the first row gives examples of general linear classifiers ; the second row gives examples of the invariant concept classes , and ( if ). Each column provides a dataset that is affected by a specific type of data transformation. From the left to the right, the datasets are affected by global scaling, global transition and the combination thereof. Data points that receive a different class label due to the data transformation are marked by a grey halo. (Online version in colour.)

Table 1.

Overview in the discussed subclasses of linear classifiers. The concept classes are reported by their name, their structural properties, their invariances and their requirements on available measurements.

name structural properties	invariant to	required features \|\|w\|\|₀
(standard) linear classifier:Clin={𝟙[⟨w,x⟩≥t] \| w∈Rn, t∈R}	—	[1; n]
single threshold classifier: Cstc={𝟙[wx(i)≥t] \| w=±1,i∈{1,…,n},t∈R}	—	1
offset-free linear classifier:Coff={𝟙[⟨w,x⟩≥0] \| w∈Rn}	fa : x↦a⋅x, with a∈R+	[1; n]
offset-free single threshold classifier: Cstc∩off={𝟙[wx(i)≥0] \| w=±1,i∈{1,…,n}}	fa : x↦a⋅x, with a∈R+	1
linear contrast classifiers: Ccon={𝟙[⟨w,x⟩≥t] \| ⟨w,1⟩=0, w∈Rn, t∈R}	fb : x↦x+b, with b=b⋅1, b∈R	[2; n]
offset-free linear contrast classifier: Coff∩con={𝟙[⟨w,x⟩≥0] \| ⟨w,1⟩=0, w∈Rn}	fa,b : x↦ax+b, with b = b · 1, a∈R+, b∈R	[2; n]
pairwise comparisions:Cmon={𝟙[x(i)−x(j)≥0] \| i≠j, i,j∈{1,…,n}}	fg(x) : (x(1)⋮x(n))↦(g(x(1))⋮g(x(n))), with ∀x(i),x(j)∈R : g(x(i))<g(x(j)) ⟺ x(i)<x(j)	2

Invariant subclasses of linear classifiers. Linear classifiers can be organized in a hierarchy of four structural subgroups that imply different invariances. Each invariance counteracts the effects of a specific type of data transformation and preserves the predictions of the corresponding classification models. Some of these invariances can also be transferred to univariate predictors. This half-order is also reflected by a decrease in the Vapnik–Chervonenkis dimension from top to bottom, implying increased generalization ability. (Online version in colour.) Structural properties of invariant linear classifiers: the first row gives examples of general linear classifiers ; the second row gives examples of the invariant concept classes , and ( if ). Each column provides a dataset that is affected by a specific type of data transformation. From the left to the right, the datasets are affected by global scaling, global transition and the combination thereof. Data points that receive a different class label due to the data transformation are marked by a grey halo. (Online version in colour.) Evaluation of experiments on artificial datasets: the accuracy differences between SVMlin and the invariant SVMs in noise-free experiments are shown. The rows show the different invariant classifiers. The columns provide the dimensionality of the underlying datasets n = {2, 10, 100}. The experiments are organized ascending according to the distances of the class centroids d (x-axis). The y-axis provides the accuracy difference. A positive value denotes a higher accuracy of the SVMlin. For each value of d, 10 experiments with different class centroids are shown. Overview in the discussed subclasses of linear classifiers. The concept classes are reported by their name, their structural properties, their invariances and their requirements on available measurements.

Material and methods

We use following notation throughout the article. A classifier will be seen as a functionmapping from the feature space to the label space . The class label of a single sample is denoted by . Most of the discussion will be focused on binary classification problems (e.g. ). We assume the feature space to be embedded in an n-dimensional Euclidian space . A sample is represented as a vector x = (x(1), …, x(). The optimal structure of a classifier c is typically unknown a priori. It has to be learned in an initial training phase consisting of two major steps. First, a concept class has to be chosen. It describes the structural properties and data-independent characteristics of a classifier. In a second step, a classifier has to be adapted to the classification task. A training algorithm l has to be chosen that fits the classifier according to a set of labelled training examples ,We omit the subscript if the training set is known from the context. The most important characteristic of a trained classifier is its generalization performance in predicting the class label of new unseen samples. It is typically estimated on an independent set of test samples . A possible quality measure would be the classifiers empirical accuracy Here, denotes the indicator function, which is equal to 1 if p is true and equal to 0 otherwise.

Invariant concept classes

Besides the overall generalization performance of a classifier, the invariances of its underlying concept class can be used for model selection. The predictions of the derived invariant classifiers will be unaffected by a family of data transformations [10]. For our analysis, we will use the following definition [14]:

Definition 2.1.

A classifier is called invariant against a parameterized class of data transformations ifA concept class is called invariant against f if each is invariant against f. Definition 2.1 calls a classifier invariant if its predictions are invariant against the influence of a parameterized class of data transformations. That is the classifier must be invariant against the influence of a data transformation for an unknown value of θ ∈ Θ. This implies that an invariant classifier is able to handle sample wise transformations. For a given test set , an invariant classifier can neglect the effects of m′ distinct data transformationsA common parameter that holds for all samples in does not have to be estimated. A classifier invariant against f is additionally invariant against sequences of data transformations A concept class that is invariant against f summarizes all classifiers that share this invariance property. If this invariance can be traced back to a common structural characteristic of the classifiers the concept class can directly be used for training a classification model that is guaranteed to be invariant against f. Here, we present structural subclasses of linear classifiers that directly lead to different invariances (table 1). Note that classifiers which constantly predict one particular class label (e.g. or ) are invariant against all possible data transformations but otherwise do not make any sense. Constant classifiers will, therefore, be excluded from the following analysis.

Linear classifiers

Linear classifiers separate the feature space via linear hyperplanes into two classes . They are given by two parameters. The norm vector determines the direction of the hyperplane. The threshold can be seen as the distance from the hyperplane to the origin.

Definition 2.2.

The concept class of linear classifiers is given by The concept class is one of the oldest ones for classification [25]. Its theoretical properties were, for example, analysed by Minsky & Papert [26] who demonstrated that Boolean functions exist that cannot be learned by linear classifiers (XOR problem). The flexibility of linear classifiers was first analysed by Cover [27]. It was proven that the probability of finding a linear classifier that perfectly separates a randomly labelled dataset increases with the dataset’s dimensionality. Linear classification models are the underlying concept class for many popular training algorithms. For example, the perceptron [28], the linear discriminant analysis [25] and the support vector machine [29] were initially designed for linear classifiers. Although these training algorithms assume to be homogeneous, there exist different ways for separating the concept class into distinct subclasses. For example, linear classifiers can be distinguished by the number of features that are involved in their decision processes . Features that receive a weight of zero do not influence the decision process and can be omitted. The exclusion of noisy or meaningless features [30], the search for highly predictive markers [31] or the reduction of the model complexity [32] are possible reasons for a feature reduction to ||w||0 ≤ k < n. Linear classifiers that rely on exactly one feature (||w||0 = 1) are summarized in the concept class of single threshold classifiers [33].

Definition 2.3.

The concept class of single threshold classifiers is defined as These classifiers are typically used as base learners for classifier ensembles [33-35]. In this context, they are also called decision stumps or single rays. Single threshold classifiers are the only linear classifiers suitable for analysing single features independently. Classifiers with ||w||0 = 0 are typically omitted. For technical reasons, we will treat a linear classifier with ||w||0 = 0 as a constant classifier (e.g. or ) in our analysis.

Invariant subclasses of linear classifiers

The following section provides an overview on the analysed invariant subclasses of linear classifiers. For each concept class, a theoretical proof on their invariance properties is given. An illustration of these concept classes can be found in figure 2. Their properties are summarized in table 1.

Offset-free linear classifiers

The first invariant subclass of is the concept class of offset-free linear classifiers , which is characterized by fixing the threshold to t = 0.

Definition 2.4 ().

The concept class of offset-free linear classifiers is defined as Fixing the threshold t = 0 forces the hyperplanes of offset-free linear classifiers through the origin, which leads to invariances different from those of general linear classifiers.

Theorem 2.5.

A non-constant linear classifier is invariant against global scalingwith if and only if .

Proof of Theorem 2.5.

In order to prove the invariance of a linear classifier to a certain type of data transformation f, we have to prove thatFor global scaling, we getFor a general linear classifier with t ≠ 0, there exists at least one for which t/a ≠ t (e.g. a = |t|). For the case of t = 0, a linear classifier is offset-free . ▪ Omitting an offset (t = 0) makes a linear classifier invariant against the global scaling of test samples, while a standard linear classifier might be misguided here. Offset-free linear classifiers can be constructed independently of the number of involved features ||w||0 ≥ 1. In particular, single threshold classifiers can fulfil the structural property of .

Definition 2.6 ().

The concept class of offset-free single threshold classifiers is defined as , Although single threshold classifiers allow a scale-invariant classification according to single features, their applicability is limited due to the fixed threshold of t = 0. An alternative might be the usage of offset-free linear classifiers with ||w||0 = 2, which are, for example, used for constructing fold-change classifiers [36].

Linear contrast classifiers

The second invariant subclass is the concept class of linear contrast classifiers [14].

Definition 2.7 ().

The concept class of linear contrast classifiers is defined as The norm vector of a linear contrast classifier is additionally constrained by . In the context of variation analysis, such linear mappings w are called contrasts [37,38]. The structural properties of a linear contrast classifier induce the invariance of .

Theorem 2.8.

A non-constant linear classifier is invariant against global transitionwith if and only if .

Proof of Theorem 2.8.

A global transition affects the decision of a linear classifier in the following way:For a linear contrast classifier (, the second term on the right-hand side is equal to zero. The scalar product is equivalent to 〈w, x〉 and the classification of the transformed sample is equivalent to the classification of the original sample. For a general linear classifier (, there exists a (e.g. b = 1) for which . This corresponds to a replacement of the original threshold t by t − d ≠ t. ▪ The predictions of the linear contrast classifier are not affected by the individual transitions of the single samples while predictions of a general linear classifier can be switched in both directions. It is worth noting that there are no single threshold classifiers that can fulfil the additional constraint of . As a consequence, at least ||w||0 ≥ 2 features are needed for constructing a linear classifier that is invariant against global scaling. In the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type , w( = −w(, i ≠ j, i, j ∈ {1, …, n}, .

Offset-free contrast classifiers

The third invariant concept class consists of those linear classifiers that fulfil the constraints of both and . It can be seen as the intersection of both concept classes.

Definition 2.9 ().

The concept class of offset-free contrast classifiers is defined as , As a classifier fulfils the structural properties of and , it is invariant to both global scaling and global transition. In addition, it is invariant against combined effects.

Theorem 2.10.

A non-constant linear classifier is invariant against linear transformation that combine a global scaling and a global transitionwith , if and only if .

Proof of Theorem 2.10.

In case of linear transformations as described in equation (2.22), the decision of a linear classifier is influenced in the following way:For a = 1, the proof is now equivalent to the proof of theorem 2.8 for the invariance of . For all other , the classifier is invariant ifwhere can be either positive or negative for different data transformations. The only unique threshold can be generated by forcing , which results in t = 0. The general linear classifier is, therefore, only invariant against f, if . ▪ As , the concept class again requires a minimal number of ||w||0 ≥ 2 features for constructing a non-constant classifier. For the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type , w( = −w(, i ≠ j, i, j ∈ {1, …, n}.

The concept class of pairwise comparisons

We change the line of argumentation for introducing the fourth invariant concept class, which we call . We first specify by its invariances and show afterwards that this subclass of linear classifiers can be defined by its structural properties.

Definition 2.11 ().

The concept class is defined as the subset of non-constant linear classifiers that is invariant against feature-wise strictly monotone increasing functions f, whereand fulfills The concept class consists of linear classifiers that are invariant against all feature-wise strictly monotone increasing effects. This set of data transformations especially includes feature-wise nonlinear effects as, for example, strictly monotone polynomial or exponential transformations. The concept class is, therefore, at least as restrictive as and shares its invariance property with rank-based classifiers [15]. Theorem 2.12 states that is a real subset of .

Theorem 2.12.

The concept class is given by

Proof of Theorem 2.12.

The proof of Theorem 2.12 is split into three parts. First, we show that no non-constant linear classifier with ||w||0 = 1 exists. In a second step, we prove that the structural properties of a classifier with ||w||0 = 2 match exactly the description given in equation (2.30). Finally, we prove that there is no non-constant classifier with ||w||0 ≥ 3. Case ||w||0 = 1: a linear classifier has to be invariant to all feature-wise strictly monotone increasing functions f. In particular, it has to be invariant to global scaling and global transition . As there is no non-constant linear classifier with ||w||0 = 1, there cannot be a non-constant linear classifier with ||w||0 = 1. Case ||w||0 = 2: the structural properties of for ||w||0 = 2 lead to the description of given in equation (2.30). The decision criterion can be rewritten as . As g is strictly monotone increasingwhich corresponds to c(f(x)) = c(x). Case ||w||0 ≥ 3: for simplicity, we will omit feature dimensions that do not have any influence on the decision rule (w( = 0). We will prove that for each linear classifier with ||w||0 = n ≥ 3 a sample and a strictly monotone function g exist for which c(x) ≠ c(f(x)). Without loss of generality, we will show that As ||w||0 = n ≥ 3, there are at least two weights which share the same sign. By permuting the ordering of the features, we can ensure that sign(w(1)) = sign(w(). We construct a sample withWe furthermore construct a strictly monotone function g with g(0) = 0. This implies g(x(1)) < 0 and g(x() > 0. The decision criterion in equation (2.32) can now be reduced toAs x( and g(x() can be randomly chosen from , we can find a pair of numbers that fulfil these equations. Similar proofs can be given for samples of class 0. ▪ In contrast to the other invariant concept classes is directly coupled to a fixed number of features ||w||0 = 2. It is restricted to the unweighted pairwise comparison of two measurements x( and x(. As a consequence, the training for a classifier is directly coupled to a feature selection process for higher dimensional settings (n > 2). For a two-dimensional subspace, exactly two classification models exist (w( = −w(, ). They both share the same decision boundary.

Vapnik–Chervonenkis dimension

Motivated by the need for invariance, we can further show that the identified subclasses also form a half-order of complexity classes which in turn can lead to an increased generalization ability. In general, the complexity of the invariant concept classes decreases with imposing additional invariances (figure 1). This, in turn, leads to a decrease in their susceptibility to overfitting [39]. The invariant concept classes can be seen as real subclasses of . Here, we provide their Vapnik–Chervonenkis dimension () as a combinatorial complexity measure [29] and show that they are lower than the VCdim of . The VCdim is closely related to the probably approximately correct (PAC) learning framework [40], where it can be used to provide upper bounds on the generalization performance of a classifier. In the case of two classifiers with equal empirical performance, the classifier with the lower VCdim should be preferred [41]. A gives the maximal number of arbitrarily chosen but fixed data points m that can be given all 2 possible labellings when classified by members . Our proofs are mainly based on the following theorem [29], where :

Theorem 2.13.

Let X be a finite-dimensional real vector space and let U be a finite-dimensional vector space of functions from X to . Let furtherThen

Proof.

We follow the original proof here [29]: we first prove by showing that for all there are points x1, …, x such that for arbitrary labellings y ∈ { − 1, 1}, i = 1, …, d of these points, there is a function u ∈ U with u(x) = y. Pick d linearly independent functions Then, as these functions are linearly independent, there are points x1, …, x ∈ X such that the vectorsare linearly independent in Therefore, their span is the whole and there are coefficients withSetting proves the claim. We now prove Set and assume the contrary, namely . Thus, for any set of labels y ∈ { − 1, 1}, there is a function and points x ∈ X such thatFor these points define the vector spacewhere 〈 · 〉 denotes the linear span. By assumption, Hence, there is a non-zero vector in the orthogonal complement of i.e.Then, by equation (2.35), there is a function u with Thus,As a ≠ 0, we have a contradiction. ▪ Using theorem 2.13, we are now able to provide the VCdim of the invariant concept classes of linear classifiers:

Theorem 2.14.

Let n be the dimensionality of the input space . The VC dimensions of the major concept classes given above (table 1) are .

Proof of Theorem 2.14.

In the proof, we make use of theorem 2.13, using a different vector space of functions U in every case. which can be seen as a random labelling and its negation. In this way, can generate at most n(n − 1) distinct labellings in . The set can, therefore, receive all 2 distinct labellings if 2 ≤ n(n − 1). The maximal set size max{m|2 ≤ n(n − 1)} is therefore an upper limit to . ▪ This result for general linear classifiers is well known in the literature [39]. For the concept class , we chose the space of linear mappings for U. It is well known that this space has dimension n [42]. Then theorem 2.13 implies the assertion. Consider the vector space which is the orthogonal complement of the space spanned by . Note that and there holdsIn theorem 2.13, we take for U the space of affine mappings from X to [42], which has dimension (n − 1) + 1 = n. We argue exactly as in step (c), except that we take for U the space of linear mappings from X to [42], which has dimension n − 1. For a fixed set of m samples and fixed pair of feature dimensions i ≠ j, with the classifiers in can result in at most two labellings

Support vector machines

In the following, we consider (linear) support vector machines (SVMs) [29] as training algorithms for the invariant concept classes. SVMs are standard training algorithms for linear classifiers. In its original form, it is designed for maximizing the margin between the training samples and the hyperplane of a linear classifier. Several modifications of the original training algorithm exist [43]. For our experiments, we have chosen two L1 soft-margin SVMs.

R2-support vector machines

The original SVM algorithm maximizes the margin by a regularization of the Euclidean norm ||w||2. It will be denoted as R2-SVM in the following. The training algorithm can be summarized by the following constrained optimization criterion: In this context, we assume class labels . The parameter ξ denotes the slack variables that enable the use of SVMs in the non-separable case by measuring deviation from the ideal condition. C is the cost parameter which induces a trade-off between margin maximization and minimization of the classification error.

R1-support vector machines

A feature selecting version of the SVM replaces the regularization of the Euclidian norm by the regularization of the Manhattan norm ||w||1. We will use the term R1-SVM throughout the manuscript. The corresponding objective replaces equation (2.40) byThe Manhattan norm is more sensitive to small weights near zero. The corresponding features will be removed from the linear decision boundary (w( = 0).

Training invariant support vector machines

The SVM training algorithm for linear classifiers can be restricted to invariant subclasses by additional constraints. These constraints reflect the structural properties of the subclasses.The trained SVMs will be denoted as SVMoff, SVMcon, SVM and SVMmon. Note that a constraint has to be added for an invariant subclass and subclasses thereof. For example, if the SVM training algorithm should be applied to a classifier both constraints for and have to be added.

Experiments

We have conducted experiments on artificial and real datasets in order to characterize how the choice of an invariant concept class influences the training of a linear SVM. All experiments were performed with help of the TunePareto software [44].

Experiments on artificial datasets

The performance of the invariant concept classes was examined in a sequence of controlled experiments on artificial datasets. A summary on all parameters is given in table 2. For these experiments, two normal distributions , were chosen as class wise distributions. Here, the class wise centroids are given by . The covariance of the classes is given by the identity matrix . The centroid of the positive class is randomly selected according to a feature wise uniform distribution with . With that, it is ensured that the components of the centroid of the positive class are always positive. The centroid of the negative class is chosen in dependency of c1. Is is given by c0 = c1 + dw/||w||2, where . In this way, the Euclidean distance between both centroids is ensured to be ||c1 − c0||2 = d.

Table 2.

Summary of the analysed experiments on artificial datasets.

experiments without noise
tested classifiers:
concept classes:	C∈{Clin,Coff,Ccon,Coff∩con,Cmon}
training algorithms:	R2-SVM, R1-SVM
dataset parameters (varied):		dataset parameters (constant):
dimensionality:	n ∈ {2, 10, 100}	samples:	m = 2 × 50
distance of centroids:	d ∈ {1, 1.1, …, 5}
repetitions:	r ∈ {1, …, 10}	summary:
		number of experiments:	1 23 000
experiments with noise
tested classifiers:
concept classes:	C∈{Clin,Coff,Ccon,Coff∩con,Cmon}
training algorithms:	R2-SVM, R1-SVM
dataset parameters (varied):		random parameters (per sample):
experiment:	ex ∈ {cl., sam.}
noise types:	id ∈ {1, …, 5}	a∼U(10−5,p)
noise parameter:	p ∈ {0, …, 5}	a∼U(10−5,p)
dimensionality:	n ∈ {2, 10, 100}	b∼U(−p,p)
repetitions:	r ∈ {1, …, 10}	c∼U(10−5,p)
dataset parameters (constant):		summary:
samples:	m = 2 × 50	number of experiments:	18 000
distance of centroids:	d = 4
noise types (id)
1. none:	f : x↦x
2. scaling:	fa : x↦a⋅x
3. transition:	fb : x↦x+b, with b=b⋅1, b∈R
4. scaling and transition:	fa,b : x↦a⋅x+b, with b=b⋅1, b∈R
5. exponential:	fc : x↦e0.2c⋅x

Summary of the analysed experiments on artificial datasets. A single experiment is parameterized by the dimensionality of the feature vectors n ∈ {2, 10, 100} and the distance between the class centroids d. A set of 2 × 50 (two classes with 50 samples each) training samples was used for adapting the SVM classifiers and a set of 2 × 50 test samples was used for evaluating their accuracy. For each dimensionality n and distance d, the experiment was repeated for 10 different pairs of class centroids r ∈ {1, …, 10}.

Experiments without noise

In this experiment, the training and test sets were analysed in their original form. The distance between the class centroids was varied d ∈ {1, 1.1, …, 5}. The performance of an invariant SVM is compared to its standard version. That means, an invariant R2-SVM is compared to the standard version of the R2-SVM and an invariant version of the R1-SVM is compared to the standard version of the R1-SVM.

Experiments with noise

The artificial datasets were also used for experiments with different types of noise (table 2). For this purpose, the samples of a dataset were partially replaced by noisy copies. The influence of a noise type was regulated by a common noise parameter p. Experiments for six different noise levels were conducted ranging from p = 0 (no noise) to p = 5 (maximal noise). The distance between the class centroids was fixed to d = 4. Experiments were conducted for two different settings: Sample wise noise: In this experiment, the individual samples of a test set were affected by individual noise effects θ ∈ Θ resulting in Class wise noise: Here, the samples of a pair of training and test sets , were affected by class wise noise effects. These effects were chosen individually for training and test samples θ, ψ ∈ Θ resulting in

Experiments on transcriptome datasets

We have conducted experiments on 27 gene expression datasets, consisting of 22 microarray and five RNA-Seq datasets. A summary of the datasets is given in table 3. We used standard and established preprocessing methodologies for the transcriptome data [67]: RMA is used for gene expression measurements based on microarrays (luminescence measurements) and includes an internal log-transformation [68], for the count data from RNA-Seq experiments, we used RSEM which does not include an internal log-transformation [69,70].

Table 3.

Summary of the used transcriptome microarray and RNA-Seq datasets. The classes, class wise sample sizes and number of features are shown.

id	tissue	class labels	samples	features
id	tissue	(y₀, y₁)	(m₀, m₁)	(n)
d₁:	bone marrow [45]	acute myeloid leukaemia (AML), mutated AML	21, 57	22 215
d₂:	breast [46]	non-inflammatory, inflammatory	69, 26	22 215
d₃:	bladder [47]	Ta, T1∪T2+	20, 20	7129
d₄:	tongue [48]	normal mucosa, oral tongue squamous cell carcinoma	26, 31	12 558
d₅:	soft tissue [49]	dedifferentiated liposarcoma, well-differentiated liposarcoma	40, 52	22 215
d₆:	lymph node [50]	intermediate, monoclonal B-cell lymphocytosis	48, 44	22 215
d₇:	brain [51]	healthy, schizophrenia	15, 13	12 558
d₈:	kidney [52]	non-tumour kidney tissue, renal cell carcinoma (RCC)	23, 69	22 215
d₉:	brain [53]	inbred alcohol-preferring, inbred alcohol-non-preferring	29, 30	8740
d₁₀:	head and neck [54]	normal mucosa, head and neck squamous cell carcinoma	22, 22	12 558
d₁₁:	lung [55]	normal tissue, adenocarcinoma	49, 58	22 215
d₁₂:	lung [56]	adenocarcinoma, squamous cell carcinoma	14, 18	12 558
d₁₃:	blood [57]	healthy, severe asthma	18, 17	32 321
d₁₄:	blood [58]	diffuse large B-cell lymphoma, follicular lymphoma	19, 58	7129
d₁₅:	prostate [59]	non-tumour prostate tissue, prostate tumour	50, 52	12 558
d₁₆:	intestinal mucosa [60]	non-cystic fibrosis, cystic fibrosis	13, 16	22 215
d₁₇:	fibroblasts [61]	healthy, macular degeneration	18, 18	12 558
d₁₈:	prostate [62]	non-recurrent cancer, recurrent cancer	40,39	22 215
d₁₉:	colon [63]	microsatellite instable tumour, microsatellite stable tumour	13, 38	7071
d₂₀:	stomach [64]	non-cardia tumour tissue, cardia tumour tissue	72, 62	22 215
d₂₁:	stomach [64]	normal gastric glands, tumour tissue	134, 134	22 215
d₂₂:	skin [65]	melanoma, metastasis	25, 24	22 215
TCGA RNA-Seq [66]
d₂₃:	kidney	chrom. RCC (ChRCC), clear cell RCC (CCRCC)	91, 606	20 655
d₂₄:	kidney	ChRCC, papillary RCC (PRCC)	91, 323	20 632
d₂₅:	kidney	CCRCC, PRCC	606, 323	20 684
d₂₆:	bile duct, pancreas	cholangiocarcinoma, pancreatic cancer	45, 183	20 439
d₂₇:	liver, pancreas	HCC, pancreatic cancer	424, 183	20 657

Summary of the used transcriptome microarray and RNA-Seq datasets. The classes, class wise sample sizes and number of features are shown. As reference k-nearest neighbours classifiers [71] (kNN) with k ∈ {1, 3, 5}, random forests [72] (RF) with nt ∈ {100, 200, 300} trees and stacked auto-encoders [73] (SAE) with three layers of units and u ∈ {100, 500, 1000} were chosen. All classifiers were evaluated in 10 × 10 cross-validations [3]. For this experiment, a dataset is split into 10 folds of approximately equal size. Nine of them are combined to a training set while the remaining one is used as a test set for evaluation. The procedure is repeated for 10 permutations of .

Results

Results on artificial datasets

The results for the noise-free experiments on artificial datasets are shown in figure 3. The accuracy differences between SVMlin and the invariant SVMs are given. A positive value denotes a higher accuracy of the SVMlin. In general, R2-SVMs and R1-SVMs react comparably on the test scenarios. It can be observed that the accuracy differences decrease with higher numbers of dimensions. Higher differences occur for larger distances of the class centroids. Over all R2-SVMs and R1-SVMs, both bias and variance decrease for increasing dimensionality. For n = 2, SVMoff, SVMcon, SVM achieve mean differences of 9.9% (IQR: [17.0%, 1.0%]), 10.1% (IQR: [17.0%, 1.0%]), 29.4%, (IQR: [42.0%, 18.0%]). For n = 100, they decrease to 0.2% (IQR: [1.0%, −1.0%]), 0.2% (IQR: [1.0%, −1.0%]), 0.2%, (IQR: [2.0%, −1.0%]).

Figure 3.

Evaluation of experiments on artificial datasets: the accuracy differences between SVMlin and the invariant SVMs in noise-free experiments are shown. The rows show the different invariant classifiers. The columns provide the dimensionality of the underlying datasets n = {2, 10, 100}. The experiments are organized ascending according to the distances of the class centroids d (x-axis). The y-axis provides the accuracy difference. A positive value denotes a higher accuracy of the SVMlin. For each value of d, 10 experiments with different class centroids are shown.

Accuracies achieved under the influence of data transformations: the figure provides the results of noise experiments with invariant R2-SVMs and R1-SVMs on artificial datasets. (a) The effects of sample wise data transformations on the test samples. (b) The influence of distinct class wise data transformations for training and test samples. The results are organized in blocks (from left to right), which correspond to the types of applied data transformations. Each column provides the results of a subclass of invariant classifiers. The rows give the dimensionality of the data n = {2, 10, 100}. Each box contains the result of 10 repetitions r ∈ {1, …, 10} and six increasing noise parameter p ∈ {0, …, 5}. (Online version in colour.) The behaviour of the SVMmon can be seen as an exception to these observations. Restricted to exactly two input dimensions, the SVMmon cannot take advantage of the high-dimensional setting. Here, the bias and variance do not decline for higher dimensionality. For n = 2, a mean difference of 29.4% (IQR: [42.0%,18.0%]) can be observed. For n = 100, it achieves 14.9% (IQR: [21.0%,8.0%]). The results of the noise experiments on artificial data are shown in figure 4. Figure 4a provides the results for the sample wise noise. In general, these experiments confirm the theoretical invariances against data transformations. It can be seen that for global scaling, SVMoff, SVM and SVMmon achieved equal accuracies for all noise levels. The performance of the SVMlin variants of R2-SVM and R1-SVM drop rapidly. For the lowest noise level p = 1, mean accuracy losses of 34.6% (IQR: [40.5%, 33.8%]) are observed for the low-dimensional setting (n = 2) and 30.2% (IQR: [36.5%, 28.5%]) for the high-dimensional setting (n = 100). For global transition, the same invariant behaviour can be observed for the classifiers SVMcon, SVM and SVMmon. Here, the lowest noise level p = 1 results in mean losses in accuracy of 2.4% (IQR: [4.0%, 0.0%]) for the SVMlin variants in the low-dimensional setting (n = 2) and 4.6% (IQR: [6.0%, 0.8%]) for the high-dimensional setting (n = 100). The combination of global scaling and global transition resulted in equal accuracies for SVM and SVMmon for every dimension and noise level. The SVMlin variants showed mean accuracy differences of 34.7% (IQR: [42.3%, 31.0%]) in the low-dimensional setting (n = 2) and 29.6% (IQR: [35.8%, 30.0%]) in the high-dimensional setting (n = 100). After performing an exponential transformation on the test data, only SVMmon led to equal accuracies for every dimension. The performance of the SVMlin variants decreased by 44.8% (IQR: [48.0%, 45.8%]) in the low-dimensional setting (n = 2) and 38.6% (IQR: [43.3%, 38.0%]) in the high-dimensional setting (n = 100).

Figure 4.

Figure 4b shows the results for the class wise noise. For global scaling, the SVMoff variants outperformed the SVMlin variants in mean by 19.8% (IQR: [40.3%, 0.0%]) accuracy over all noise levels and all repetitions in the low-dimensional setting (n = 2). For the high-dimensional setting, a mean improvement of 35.3% (IQR: [48.3%, 3.0%]) accuracy was observed. For the global transition, the SVMcon gained in mean 8.1% (IQR: [27.3%, ]) accuracy for n = 2 and 33.5% (IQR: [47.3%, 0.0%]) for n = 100. In case of global scaling and transition, the SVM variants achieved in mean (IQR: [11.0%, ]) less accuracy in the low-dimensional settings and 40.7% (IQR: [77.3%, 15.8%]) more accuracy in the high-dimensional setting. For the exponential transformation, the SVMmon variants showed a performance decreased in mean by (IQR: [0.0%, ]) for n = 2. It was in mean increased by 11.1% (IQR: [23.8%, ]) for n = 100.

Results on transcriptome datasets

The accuracies achieved on the microarray and RNA-Seq datasets are shown in figure 5 and tabularized in the electronic supplementary material.

Figure 5.

Results of 10 × 10 cross-validation experiments for transcriptome data: the mean accuracy is shown for the five concept classes of linear support vector machines (R2 and R1), for kNN with k ∈ {1, 3, 5}, for random forests with nt ∈ {100, 200, 300} trees and the stacked auto-encoders SAE with u ∈ {100, 500, 1000} units. Baseline denotes the performance of the classifier that always choses the larger class. (Online version in colour.) The R2-SVMlin outperformed the kNN (k ∈ {1, 3, 5}) on {25, 25, 26} datasets. It was inferior in {2, 2, 1} cases. The R1-SVMlin was better than the kNN in {19, 21, 20} cases. In {7, 6, 7} settings the kNN was superior. In comparison to the RFs with nt ∈ {100, 200, 300, 1000} trees the R2-SVMlin achieved better accuracies on {19, 20, 19, 18} datasets. Its accuracy was inferior on {7, 6, 6, 7} cases. The R1-SVMlin outperformed the RFs on {12, 12, 12, 11} datasets. The RFs had higher accuracies on {15, 15, 15, 16} datasets. Th R2-SVMlin showed better performance than SAE with u ∈ {100, 500, 1000} in {25, 25, 25} cases. They were outperformed on {2, 2, 2} datasets. For the R1-SVM, better performances were observed in {26, 26, 26} cases. Lower performance was gained on {1, 1, 1} datasets. Overall, the respective invariant SVMs achieved better or equal results compared to the linear one in 41 of 54 cases. At the level of individual invariant linear SVMs, it can be observed that for 20 out of 27 datasets, an invariant R2-SVM was able to achieve the same or a higher mean accuracy than R2-SVMlin (R1-SVMs: 21 datasets). R2-SVMoff outperformed R2-SVMlin in four cases (R1-SVMs: 14 cases), achieved the same accuracy in 14 cases (R1-SVMs: two cases) and achieved a lower accuracy in nine cases (R1-SVMs: 11 cases). R2-SVMcon was able to achieve higher accuracies than R2-SVMlin for 0 datasets (R1-SVMs: 18 datasets), equal accuracies on 17 datasets (R1-SVMs: 0 datasets) and lower accuracies for 10 datasets (R1-SVMs: nine datasets). R2-SVM was capable of achieving a higher accuracy than R2-SVMlin in six cases (R1-SVMs: 14 cases), an equal accuracy in 12 out of 27 cases (R1-SVMs: 0 cases) and a lower accuracy in nine cases (R1-SVMs: 13 cases). The internally feature selecting R2-SVMmon was never able to achieve a higher accuracy than R2-SVMlin, but the R1-SVMmon outperformed its linear variant in four cases. For two (R1-SVM: 0) out of 27 datasets, R2-SVMmon achieved the same accuracy as R2-SVMlin and for 25 datasets (R1-SVM: 23 datasets) it led to a lower accuracy. Besides the two-dimensional SVMmon classifiers the R1-SVMs yields at the reduction of features that influence the final decision boundary. An overview on the mean percentage of used features is shown in the electronic supplementary material. In all experiments, no classifier selects more than 1% of the available features. The unconstrained SVMlin constructed decision boundaries based on 0.06% to 0.51% of all features. The absolute mean size of these signatures lies in between 7.36 and 104.65 features. The invariant SVMs select comparable percentages of features. They lie in the ranges of 0.07% and 0.50% (SVMoff), 0.07% and 0.86% (SVMcon) and 0.07% and 0.51% (SVM). This translates to a mean signature size of 9.93 and 102.76 (SVMoff), 9.57 and 105.08 (SVMcon) and 11.04 and 103.37 (SVM).

Discussion

In this work, we derived four invariant types of linear classifiers. The structural properties of these models allow guaranteeing invariances in the presence of small collections of molecular profiles, where malicious variation might not even be detected. From bench to bioinformatics, the extraction of molecular profiles requires multiple preprocessing steps which have to fulfil strict protocols and often need the collaboration of different experts or institutes. Deviations or differences of these protocols can lead to noise and bias, which might lead to imprecise estimates and wrong conclusions [38]. Invariances applied can be preventive in this context. A particular type of information, which is assumed to be affected, will be neglected in subsequent modelling processes. This work is related somehow to work by the group of Rainer Spang on zero-sum regression [11,12]; in fact, our classifier corresponds to this concept class. Here, we extend and generalize this approach and also embed it into the PAC learning framework. However, ignoring a specific type of information might result in diminished classification accuracies. Our experiments with invariant support vector machines indicate that incorporating invariances against global scaling and transition leads to approximately equal performance in high-dimensional biomarker settings. In this case, the differences in the complexity of the concept classes decrease. Decreased accuracies were only observed in experiments with low dimensionality. By contrast, restriction to exactly two input variables, which is required for the strictest invariant subclass, can affect a classifier’s performance. Also, sparsity and invariance principles can be combined harmonically. The general findings described above can be observed for the feature selecting, invariant manhattan norm support vector machine. These results show that invariances can be incorporated into feature selection processes and might be used for constructing invariant marker signatures. In this case, the invariance on the full feature space is transferred to the reduced representation. The signatures of the invariant manhattan norm support vector machines have approximately the same length as their non-invariant counterpart. In our experiments, invariances against global scaling or transition result in signatures comprising in mean 0.07% to 0.51% of all available biomarkers, i.e. we obtain invariant signatures of mean length of 15.77 to 105.08 markers. Our theoretical analysis, i.e. estimating the VC dimension of the four invariant concept classes, also reveals construction principles for other invariant concepts or more complex invariant classification models. The analysed hierarchy of concept classes does reflect not only an accumulation of invariances but also a reduction of the VC dimension. These analyses indicate that a restriction to invariant classification models also reduces the complexity of the corresponding concept classes and the risk of overfitting. Suitable models might be chosen according to the PAC learning framework. Invariances can lead to constraints on the dimensionality of the input space of a linear classifier. While invariance against global scaling require multivariate profiles, the invariance against order-preserving functions is only guaranteed for the use of two covariates. Univariate linear classifiers do not match both criteria. These invariances do not, therefore, hold for architectures that are based on single-threshold classifiers. Among these architectures are standard implementations of hierarchical systems such as classification or regression trees or ensemble classifiers such as boosting ensembles. However, these systems can gain the desired invariances by completely replacing all univariate linear classifiers by higher dimensional invariant ones. Identifying suitable combinations of fusion architectures and invariant concept classes can be seen as a natural extension of this work.

30 in total

1. Scale-Invariant Biomarker Discovery in Urine and Plasma Metabolite Fingerprints.

Authors: Helena U Zacharias; Thorsten Rehberg; Sebastian Mehrl; Daniel Richtmann; Tilo Wettig; Peter J Oefner; Rainer Spang; Wolfram Gronwald; Michael Altenbuchinger
Journal: J Proteome Res Date: 2017-09-07 Impact factor: 4.466

2. Acute myeloid leukemia bearing cytoplasmic nucleophosmin (NPMc+ AML) shows a distinct gene expression profile characterized by up-regulation of genes involved in stem-cell maintenance.

Authors: Myriam Alcalay; Enrico Tiacci; Roberta Bergomas; Barbara Bigerna; Elisa Venturini; Simone P Minardi; Natalia Meani; Daniela Diverio; Loris Bernard; Laura Tizzoni; Sara Volorio; Lucilla Luzi; Emanuela Colombo; Francesco Lo Coco; Cristina Mecucci; Brunangelo Falini; Pier Giuseppe Pelicci
Journal: Blood Date: 2005-04-14 Impact factor: 22.113

3. Expression profiling of liposarcoma yields a multigene predictor of patient outcome and identifies genes that contribute to liposarcomagenesis.

Authors: Ryan M Gobble; Li-Xuan Qin; Elliott R Brill; Christina V Angeles; Stacy Ugras; Rachael B O'Connor; Nicole H Moraco; Penelope L Decarolis; Cristina Antonescu; Samuel Singer
Journal: Cancer Res Date: 2011-02-18 Impact factor: 12.701

4. Differences in gene expression profiles in dermal fibroblasts from control and patients with age-related macular degeneration elicited by oxidative injury.

Authors: Nataly Strunnikova; Sara Hilmer; Jessica Flippin; Michael Robinson; Eric Hoffman; Karl G Csaky
Journal: Free Radic Biol Med Date: 2005-09-15 Impact factor: 7.376

5. Gene expression patterns in mismatch repair-deficient colorectal cancers highlight the potential therapeutic role of inhibitors of the phosphatidylinositol 3-kinase-AKT-mammalian target of rapamycin pathway.

Authors: Eduardo Vilar; Bhramar Mukherjee; Rork Kuick; Leon Raskin; David E Misek; Jeremy M G Taylor; Thomas J Giordano; Samir M Hanash; Eric R Fearon; Gad Rennert; Stephen B Gruber
Journal: Clin Cancer Res Date: 2009-04-07 Impact factor: 12.531

6. Optimizing molecular signatures for predicting prostate cancer recurrence.

Authors: Yijun Sun; Steve Goodison
Journal: Prostate Date: 2009-07-01 Impact factor: 4.104

7. Functional gene expression differences between inbred alcohol-preferring and -non-preferring rats in five brain regions.

Authors: Mark W Kimpel; Wendy N Strother; Jeanette N McClintick; Lucinda G Carr; Tiebing Liang; Howard J Edenberg; William J McBride
Journal: Alcohol Date: 2007-03 Impact factor: 2.405

8. Gene expression correlates of clinical prostate cancer behavior.

Authors: Dinesh Singh; Phillip G Febbo; Kenneth Ross; Donald G Jackson; Judith Manola; Christine Ladd; Pablo Tamayo; Andrew A Renshaw; Anthony V D'Amico; Jerome P Richie; Eric S Lander; Massimo Loda; Philip W Kantoff; Todd R Golub; William R Sellers
Journal: Cancer Cell Date: 2002-03 Impact factor: 31.743

9. Oral tongue cancer gene expression profiling: Identification of novel potential prognosticators by oligonucleotide microarray analysis.

Authors: Cherry L Estilo; Pornchai O-charoenrat; Simon Talbot; Nicholas D Socci; Diane L Carlson; Ronald Ghossein; Tijaana Williams; Yoshihiro Yonekawa; Yegnanarayana Ramanathan; Jay O Boyle; Dennis H Kraus; Snehal Patel; Ashok R Shaha; Richard J Wong; Joseph M Huryn; Jatin P Shah; Bhuvanesh Singh
Journal: BMC Cancer Date: 2009-01-12 Impact factor: 4.430

10. The CF-modifying gene EHF promotes p.Phe508del-CFTR residual function by altering protein glycosylation and trafficking in epithelial cells.

Authors: Frauke Stanke; Andrea van Barneveld; Silke Hedtfeld; Stefan Wölfl; Tim Becker; Burkhard Tümmler
Journal: Eur J Hum Genet Date: 2013-10-09 Impact factor: 4.246