| Literature DB >> 32019472 |
Ludwig Lausser1, Robin Szekely1, Attila Klimmek1, Florian Schmid1, Hans A Kestler1,2.
Abstract
Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik-Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.Entities:
Keywords: classification; computational learning theory; invariances; molecular profiles
Year: 2020 PMID: 32019472 PMCID: PMC7061712 DOI: 10.1098/rsif.2019.0612
Source DB: PubMed Journal: J R Soc Interface ISSN: 1742-5662 Impact factor: 4.118
Figure 1.Invariant subclasses of linear classifiers. Linear classifiers can be organized in a hierarchy of four structural subgroups that imply different invariances. Each invariance counteracts the effects of a specific type of data transformation and preserves the predictions of the corresponding classification models. Some of these invariances can also be transferred to univariate predictors. This half-order is also reflected by a decrease in the Vapnik–Chervonenkis dimension from top to bottom, implying increased generalization ability. (Online version in colour.)
Figure 2.Structural properties of invariant linear classifiers: the first row gives examples of general linear classifiers ; the second row gives examples of the invariant concept classes , and ( if ). Each column provides a dataset that is affected by a specific type of data transformation. From the left to the right, the datasets are affected by global scaling, global transition and the combination thereof. Data points that receive a different class label due to the data transformation are marked by a grey halo. (Online version in colour.)
Overview in the discussed subclasses of linear classifiers. The concept classes are reported by their name, their structural properties, their invariances and their requirements on available measurements.
| name structural properties | invariant to | required features || |
|---|---|---|
| (standard) linear classifier: | — | [1; |
| single threshold classifier: | — | 1 |
| offset-free linear classifier: | [1; | |
| offset-free single threshold classifier: | 1 | |
| linear contrast classifiers: | [2; | |
| offset-free linear contrast classifier: | [2; | |
| pairwise comparisions: | 2 |
Summary of the analysed experiments on artificial datasets.
| | |||
| concept classes: | |||
| training algorithms: | R2-SVM, R1-SVM | ||
| | |||
| dimensionality: | samples: | ||
| distance of centroids: | |||
| repetitions: | |||
| number of experiments: | 1 23 000 | ||
| | |||
| concept classes: | |||
| training algorithms: | R2-SVM, R1-SVM | ||
| | |||
| experiment: | |||
| noise types: | |||
| noise parameter: | |||
| dimensionality: | |||
| repetitions: | |||
| | |||
| samples: | number of experiments: | 18 000 | |
| distance of centroids: | |||
| 1. none: | |||
| 2. scaling: | |||
| 3. transition: | |||
| 4. scaling and transition: | |||
| 5. exponential: | |||
Summary of the used transcriptome microarray and RNA-Seq datasets. The classes, class wise sample sizes and number of features are shown.
| id | tissue | class labels | samples | features |
|---|---|---|---|---|
| ( | ( | ( | ||
| bone marrow [ | acute myeloid leukaemia (AML), mutated AML | 21, 57 | 22 215 | |
| breast [ | non-inflammatory, inflammatory | 69, 26 | 22 215 | |
| bladder [ | Ta, T1 | 20, 20 | 7129 | |
| tongue [ | normal mucosa, oral tongue squamous cell carcinoma | 26, 31 | 12 558 | |
| soft tissue [ | dedifferentiated liposarcoma, well-differentiated liposarcoma | 40, 52 | 22 215 | |
| lymph node [ | intermediate, monoclonal B-cell lymphocytosis | 48, 44 | 22 215 | |
| brain [ | healthy, schizophrenia | 15, 13 | 12 558 | |
| kidney [ | non-tumour kidney tissue, renal cell carcinoma (RCC) | 23, 69 | 22 215 | |
| brain [ | inbred alcohol-preferring, inbred alcohol-non-preferring | 29, 30 | 8740 | |
| head and neck [ | normal mucosa, head and neck squamous cell carcinoma | 22, 22 | 12 558 | |
| lung [ | normal tissue, adenocarcinoma | 49, 58 | 22 215 | |
| lung [ | adenocarcinoma, squamous cell carcinoma | 14, 18 | 12 558 | |
| blood [ | healthy, severe asthma | 18, 17 | 32 321 | |
| blood [ | diffuse large B-cell lymphoma, follicular lymphoma | 19, 58 | 7129 | |
| prostate [ | non-tumour prostate tissue, prostate tumour | 50, 52 | 12 558 | |
| intestinal mucosa [ | non-cystic fibrosis, cystic fibrosis | 13, 16 | 22 215 | |
| fibroblasts [ | healthy, macular degeneration | 18, 18 | 12 558 | |
| prostate [ | non-recurrent cancer, recurrent cancer | 40,39 | 22 215 | |
| colon [ | microsatellite instable tumour, microsatellite stable tumour | 13, 38 | 7071 | |
| stomach [ | non-cardia tumour tissue, cardia tumour tissue | 72, 62 | 22 215 | |
| stomach [ | normal gastric glands, tumour tissue | 134, 134 | 22 215 | |
| skin [ | melanoma, metastasis | 25, 24 | 22 215 | |
| TCGA RNA-Seq [ | ||||
| kidney | chrom. RCC (ChRCC), clear cell RCC (CCRCC) | 91, 606 | 20 655 | |
| kidney | ChRCC, papillary RCC (PRCC) | 91, 323 | 20 632 | |
| kidney | CCRCC, PRCC | 606, 323 | 20 684 | |
| bile duct, pancreas | cholangiocarcinoma, pancreatic cancer | 45, 183 | 20 439 | |
| liver, pancreas | HCC, pancreatic cancer | 424, 183 | 20 657 | |
Figure 3.Evaluation of experiments on artificial datasets: the accuracy differences between SVMlin and the invariant SVMs in noise-free experiments are shown. The rows show the different invariant classifiers. The columns provide the dimensionality of the underlying datasets n = {2, 10, 100}. The experiments are organized ascending according to the distances of the class centroids d (x-axis). The y-axis provides the accuracy difference. A positive value denotes a higher accuracy of the SVMlin. For each value of d, 10 experiments with different class centroids are shown.
Figure 4.Accuracies achieved under the influence of data transformations: the figure provides the results of noise experiments with invariant R2-SVMs and R1-SVMs on artificial datasets. (a) The effects of sample wise data transformations on the test samples. (b) The influence of distinct class wise data transformations for training and test samples. The results are organized in blocks (from left to right), which correspond to the types of applied data transformations. Each column provides the results of a subclass of invariant classifiers. The rows give the dimensionality of the data n = {2, 10, 100}. Each box contains the result of 10 repetitions r ∈ {1, …, 10} and six increasing noise parameter p ∈ {0, …, 5}. (Online version in colour.)
Figure 5.Results of 10 × 10 cross-validation experiments for transcriptome data: the mean accuracy is shown for the five concept classes of linear support vector machines (R2 and R1), for kNN with k ∈ {1, 3, 5}, for random forests with nt ∈ {100, 200, 300} trees and the stacked auto-encoders SAE with u ∈ {100, 500, 1000} units. Baseline denotes the performance of the classifier that always choses the larger class. (Online version in colour.)