| Literature DB >> 35205563 |
Vittoria Bruni1,2, Maria Lucia Cardinali1, Domenico Vitulano1,2.
Abstract
The minimun description length (MDL) is a powerful criterion for model selection that is gaining increasing interest from both theorists and practicioners. It allows for automatic selection of the best model for representing data without having a priori information about them. It simply uses both data and model complexity, selecting the model that provides the least coding length among a predefined set of models. In this paper, we briefly review the basic ideas underlying the MDL criterion and its applications in different fields, with particular reference to the dimension reduction problem. As an example, the role of MDL in the selection of the best principal components in the well known PCA is investigated.Entities:
Keywords: classification; dimension reduction; features extraction; minimum description length; principal component analysis
Year: 2022 PMID: 35205563 PMCID: PMC8871178 DOI: 10.3390/e24020269
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1TEST 1. (Top) Red dotted line: versus the number of components k for , ; the minimum is correctly attained at . Blue dashed line: versus the number of components k for , ; the minimum is correctly attained at . (Bottom) The same plot where has been considered to improve its readability in correspondence to the minimum value.
Figure 2TEST 2. Plot of and its components versus the number of components k. (Top): non-normalized data. (Bottom): normalized data.
Figure 3TEST 2. (Top) Plot of and its components versus the number of components k. Signals have been uniformly sampled so that the dimension of is . (Bottom) Plot of ).
Figure 4TEST 3. Plot of and its components versus the number of components k; the minimum is attained at .
Figure 5TEST 3. (Left) Ground-truth Indian Pines image; (Middle) classification image using the best result of PCA-SVM in [78]; (Right) classification image using the PCA-SVM method and the number of components estimated using the stochastic complexity, as in Equation (5).
Number of principal components selected for the three tests by using different criteria: percentage of variance to be retained (), Bartlett’s test with significance level equal to 0.05 and 0.01, and the MDL criterion. The last column contains the expected number.
| Test | 90% | 95% | 99% | Bartlett’s Test ( | Bartlett’s Test ( | MDL | True Value |
|---|---|---|---|---|---|---|---|
| Test 1 | 1 | 1 | 1 | 30 | 30 | 5 | 5 |
| Test 2 | 37 | 52 | 75 | 90 | 90 | 90 | 3 |
| Test 2 | 2 | 6 | 24 | 63 | 63 | 2 | 3 |
| (decimated data) | |||||||
| Test 3 | 2 | 6 | 27 | 161 | 159 | 22 | 22 |