| Literature DB >> 33644472 |
Yinsheng Zhang1,2, Qian Shang3,2, Guoming Zhang4,5.
Abstract
High-dimensional data are pervasive in this bigdata era. To avoid the curse of the dimensionality problem, various dimensionality reduction (DR) algorithms have been proposed. To facilitate systematic DR quality comparison and assessment, this paper reviews related metrics and develops an open-source Python package pyDRMetrics. Supported metrics include reconstruction error, distance matrix, residual variance, ranking matrix, co-ranking matrix, trustworthiness, continuity, co-k-nearest neighbor size, LCMC (local continuity meta criterion), and rank-based local/global properties. pyDRMetrics provides a native Python class and a web-oriented API. A case study of mass spectra is conducted to demonstrate the package functions. A web GUI wrapper is also published to support user-friendly B/S applications.Entities:
Keywords: Co-k-nearest neighbor; Co-ranking matrix; Dimensionality reduction; Distance matrix; Reconstruction error
Year: 2021 PMID: 33644472 PMCID: PMC7887408 DOI: 10.1016/j.heliyon.2021.e06199
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Taxonomy of dimensionality reduction algorithms. Image adapted from Van Der Maaten [1].
Symbols and terms.
| Symbol/Term | Explanation |
|---|---|
| DR | Dimensionality Reduction |
| The data before DR. An | |
| The | |
| Data after DR. An | |
| Reconstructed data from | |
| The number of samples/data points. | |
| Dimension number before DR. The Dimensionality of the original space. Column number of | |
| In the context of DR, | |
| In the context of the rank-based metrics, such as | |
| MSE | Mean Squared Error |
| rMSE | relative Mean Squared Error |
| The Frobenius norm of a matrix | |
| The cardinality of a set. The number of elements in a set. | |
| An | |
| The distance matrix. An | |
| Pearson correlation coefficient | |
| Residual variance | |
| The ranking matrix. An | |
| The co-ranking matrix. An | |
| Trustworthiness | |
| Continuity | |
| co- | |
| AUC | Area Under Curve |
| LCMC | Local Continuity Meta Criterion |
| The maximum cutoff point of LCMC | |
| Local property metric | |
| Global property metric |
Figure 2(a) An illustration of the co-ranking matrix. The intrusions are located in the low triangle region, which means DR pulls far-away points closer. The extrusions are in the upper triangle, which means DR pushes near points apart. The right-bottom area is a trivial region for large ranks, which are much less important than local relations. (b) An ideal co-ranking matrix is diagonal. All diagonal elements are m, while all non-diagonal elements are zeros.
DR Quality Metrics and their Explanations.
| Metric | Math Equation | Explanation | Comment |
|---|---|---|---|
| Reconstruction Error | Reconstruction error, measured by the MSE between | Requires the reconstructed data | |
| Relative Reconstruction Error | Relative reconstruction error, measured by the relative MSE between | ||
| Distance matrix | Measures the pair-wise distance property. The distance matrices before and after DR should be similar. | The distance can be Euclidean, or the RBF (radial basis function) similarity | |
| Residual variance | Residual variance of the distance matrices before and after DR. | ||
| Ranking matrix | Contains the ranking information. The ranking matrices before and after DR should be similar. | ||
| Co-ranking matrix | Measures how sample ranks change after the DR. | ||
| Trustworthiness | Measures error of hard intrusions. | Ranges from 0 to 1. | |
| Continuity | Measures error of hard extrusions. | Ranges from 0 to 1. | |
| Co- | Count how many points are in both | Divide by ( | |
| The area under the curve | The area under the | Ranges from 0.5 to 1. | |
| Local Continuity Meta Criterion | LCMC is | LCMC favors locality more than | |
| Local property metric | Measures local property (small | ||
| Global property metric | Measures global property (big |
These items are single numeric metrics. Others are arrays or matrices.
API definition.
| Class DRMetrics(builtins.object) | Define a set of dimensionality reduction metrics. |
|---|---|
| Properties / Fields | |
| X | Data before DR. m-by-n matrix. m is the sample size. n is the dimension/feature number. |
| Z | Data after DR. m-by-k matrix. Typically, k << n |
| Xr | Reconstructed Data. m-by-n matrix. Optional parameter. If a DR algorithm has no inverse transform. Pass None. |
| D | Distance matrix of X |
| Dz | Distance matrix of Z |
| Vr | The residual variance between D and Dz. The default version uses Pearson’s r to calculate the residual variance. Use Vrs for the Spearman’s r version. |
| mse | Reconstruction error. MSE of X and Xr |
| rmse | Relative reconstruction error. Relative MSE of X and Xr |
| R | Ranking matrix of X |
| Rz | Ranking matrix of Z |
| Q | Co-ranking matrix between R and Rz |
| T | Trustworthiness. An array. There is also a single-valued AUC_T that measures the area under the T(trustworthiness) curve. |
| C | Continuity. An array. There is also a single-valued AUC_C that measures the area under the C(continuity) curve. |
| QNN | Co-k-nearest neighbor size. An array. |
| AUC | The area under the QNN curve. |
| LCMC | Local Continuity Meta Criterion. An array. |
| Qlocal | Local property metric |
| Qglobal | Global property metric |
| Member Methods | |
| __init__(self, X, Z, Xr=None) | Constructor. X is the original data. Z is the data after DR. Xr is the reconstructed data. |
| test(cls, csv, k = 3, dr = 'PCA') | A constructor overload that facilitates testing common DR algorithms. csv is the data file. dr is used to specify the algorithm, e.g. “PCA”, “NMF”, “RP”, “TSNE”, etc. |
| plot_coranking_matrix(self) | Visualize the co-ranking matrix between R and Rz as heatmaps. |
| plot_distance_matrix(self) | Visualize the distance matrices before and after DR as heatmaps, i.e., D and Dz. |
| plot_ranking_matrix(self) | Visualize the ranking matrices before and after DR as heatmaps, i.e., R and Rz. |
| visualize_reconstruction(self) | Plot the original data and the reconstruction data side by side. Show 3 random samples. |
| report(self) | Print out a summary report, with inline plots. |
| get_json(self) | Generate a JSON-format dictionary object containing all DR quality metrics. Only numeric metrics are returned. |
| get_html(self) | Generate an HTML segment that can be embedded in web pages. Plotted images are embedded as base64 strings to avoid referencing external files. |
| Sample Code | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
Figure 3Scheme of the TOF MS (time-of-flight mass spectroscopy). m/z = mass-to-charge ratio. t = drift time. The detector collects the arrival time of charged particles (e.g., ions), and we can then calculate the m/z (mass-to-charge ratio) by the drift time. The signal intensity received by the detector at a specific drift time corresponds to the abundance/quantity of a specific particle.
Figure 4The averaged waveform of the SELDI-TOF-MS dataset.
Figure 5The explained variances of different numbers of components.
DR Quality Metrics of PCA (k = 5) returned by pyDRMetrics
| Metric | DRMetrics | Returned value |
|---|---|---|
| Reconstruction Error | obj.mse | 0.00683 |
| Relative Reconstruction Error | obj.rmse | 0.0355 |
| Show random samples before and after DR, side by side | obj. visualize_reconstruction() | |
| Distance matrices before and after DR | obj.D, obj.Dz, | |
| Residual variance | obj.Vr, obj.Vrs | 0.029, 0.028 |
| Ranking matrices before and after DR | obj.R, obj.Rz, obj. plot_ranking_matrix() | |
| Co-ranking matrix | obj.Q, | |
| Trustworthiness, Continuity | obj.T, obj.C | |
| obj.AUC_T, obj.AUC_C | 0.996, 0.998 | |
| Co- | obj.QNN | |
| The area under the curve of QNN | obj.AUC | 0.932 |
| Local Continuity Meta Criterion | obj.LCMC | |
| Maximum cutoff of LCMC | obj.kmax | 23 |
| Local property | obj.Qlocal | 0.717 |
| Global property | obj.Qglobal | 0.954 |
Figure 6The curves of numeric metrics against different k values. (1) Relative reconstruction error (RMSE) curve. (2) The curve of the residual variance between distance matrices. (3) The curve of QNN AUC (area under the curve). (4) Qlocal curve. (5) Qglobal curve.
DR metrics for two special cases.
| Metric | ||
|---|---|---|
| Co-ranking matrix (Q) | ||
| Co- | ||
| Local Continuity Meta Criterion (LCMC |
Figure 7The web-based GUI powered by pyDRMetrics. The GUI has a basic mode that supports testing built-in public algorithms and an extended mode for testing user-defined algorithms. The GUI is published at http://spacs.brahma.pub/research/DR.