| Literature DB >> 34352006 |
Jörn Lötsch1,2, Sebastian Malkusch1, Alfred Ultsch3.
Abstract
MOTIVATION: The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method.Entities:
Mesh:
Year: 2021 PMID: 34352006 PMCID: PMC8341664 DOI: 10.1371/journal.pone.0255838
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Mean squared errors of PCA and autoencoder-based data reproduction of the remaining data from the sampled data subset.
Samples of 0.001 and 0.01%, for the smaller iris and miRNA data sets of 1% and 10%, of the data were drawn once using uniform sampling or 1,000 times using uniform sampling with different seeds, followed by selection of the sample that best matched the original distribution of variables, judged by statistical comparisons of probability density functions. The sampled data were subjected to projection using either PCA or a single-layer autoencoder, and then the projection parameters were used to predict the remaining data that had not been sampled from the original data set. The experiments were performed in 20 replicates starting with different and non-redundant seeds, and the means and standard deviations of the mean square errors of the data reproduction obtained during these replicates are shown.
| Data set | Downsampling experiments | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Name | Size | Trials | 1 trial | 1,000 trials | 1 trial | 1,000 trials | Trials | 1 trial | 1,000 trials | 1 trial | 1,000 trials | ||
| Instances | Features | Percent sampled | PCA | PCA | Autoencoder | Autoencoder | Percent sampled | PCA | PCA | Autoencoder | Autoencoder | ||
| Iris | 150 | 4 | 1 | 0.14180 ± 0.04654 | 0.10003 ± 0.00685 | 0.16102 ± 0.07157 | 0.10049 ± 0.00762 | 10 | 0.09951 ± 0.00901 | 0.09049 ± 0.00217 | 0.07306 ± 0.01271 | 0.06273 ± 0.00528 | |
| Artificial | 30,000 | 10 | 0.001 | 154.8562 ± 21.04911 | 132.7860 ± 14.16717 | 291.8664 ± 98.05335 | 238.0392 ± 36.28223 | 0.01 | 111.9968 ± 23.77421 | 96.0451 ± 12.64811 | 279.0497 ± 107.61263 | 228.9243 ± 43.90218 | |
| Single Cell | 23,377 | 4 | 0.001 | 2510.473 ± 1215.4342 | 1960.399 ± 767.2990 | 46284.66 ± 10471.239 | 36119.58 ± 11173.769 | 0.01 | 116.858 ± 522.6049 | 208.169 ± 650.8854 | 40136.55 ± 10603.902 | 28433.88 ± 6664.088 | |
| FACS | 111686 | 6 | 0.001 | 1.08773 ± 0.34595 | 0.81541 ± 0.10037 | 1.22914 ± 0.65268 | 0.83760 ± 0.17717 | 0.01 | 0.35906 ± 0.13173 | 0.26569 ± 0.02517 | 0.26754 ± 0.07288 | 0.21012 ± 0.03627 | |
| miRNA | 94 | 184 | 1 | 1.16598 ± 0.22298 | 0.96829 ± 0.05236 | 3.16437 ± 0.55408 | 2.86462 ± 0.58554 | 10 | 0.82086 ± 0.11161 | 0.73401 ± 0.04832 | 2.83602 ± 0.30254 | 2.69106 ± 0.04607 |
Comparison of various tests for differences between the distribution of the downsampled data compared to the distribution of the full data set.
For this test, the Iris data set was used due to its computational speed and wide use in method development. The values represent the means and standard deviations of the mean square errors (MSE) of PCA-based reproduction of the remaining data from the downsampled data. The experiments were performed in 20 replicates starting with different seeds, and the means and standard deviations of the mean square errors of the data reproduction obtained during these replicates are shown. The similarity measures are sorted in ascending order by the ranks of the MSE obtained when the final sample is chosen from 10,000 random samples.
| Distance test | 1% sampled 1 trial | 1% sampled 10,000 trials | 10% sampled 1 trial | 10% sampled 10,000 trials |
|---|---|---|---|---|
| 0.14180 ± 0.04654 | 0.09271 ± 0.00524 | 0.09951 ± 0.00901 | 0.08942 ± 0.00223 | |
| 0.09608 ± 0.00384 | 0.08903 ± 0.00274 | |||
| 0.09408 ± 0.00300 | 0.09042 ± 0.00196 | |||
| 0.04215 ± 0.00717 | 0.09590 ± 0.00989 | |||
| 0.11496 ± 0.01739 | 0.09037 ± 0.00334 | |||
| 0.10991 ± 0.00956 | 0.09245 ± 0.00322 | |||
| 0.11394 ± 0.01579 | 0.09060 ± 0.00318 | |||
| 0.10814 ± 0.00735 | 0.10232 ± 0.00991 | |||
| 0.12021 ± 0.01755 | 0.09332 ± 0.00640 |
Results of analysis of variance (ANOVA) of the mean squared data reconstruction errors.
The ANOVA was performed the factors "fraction" and "number of trials" of the mean squared errors of the reconstruction of the remaining data from the downsampled data. Depending on the size of the data sets, class-proportional uniformly distributed random samples of, e.g., 0.001, 0.01, 0.1, 1, 5, 10, 25, and 50% of the original data were drawn (). The experiments were performed in 20 replicates, each stating at a different and non-redundant seed.
| Data set # | Data set name | ANOVA factor | Degrees of freedom | F value | p-value |
|---|---|---|---|---|---|
| Iris | Fraction | 4 | 61.809 | < 2 · 10-16 | |
| Number of trials | 4 | 33.867 | < 2 · 10-16 | ||
| Fraction * number of trials | 16 | 7.423 | 1.31 · 10-15 | ||
| Residuals | 475 | ||||
| Artificial | Fraction | 2 | 1653.368 | < 2 · 10-16 | |
| Number of trials | 3 | 10.247 | 2.35 · 10−6 | ||
| Fraction * number of trials | 6 | 2.732 | 0.0139 | ||
| Residuals | 228 | ||||
| Single Cell | Fraction | 2 | 358.707 | < 2 · 10-16 | |
| Number of trials | 3 | 3.257 | 0.0224 | ||
| Fraction * number of trials | 6 | 2.338 | 0.0328 | ||
| Residuals | 228 | ||||
| FACS | Fraction | 2 | 649.689 | < 2 · 10-16 | |
| Number of trials | 3 | 9.953 | 3.43 · 10-6 | ||
| Fraction * number of trials | 6 | 4.309 | 0.000387 | ||
| Residuals | 228 | ||||
| miRNA | Fraction | 4 | 1307.21 | < 2 · 10-16 | |
| Number of trials | 4 | 21.81 | < 2 · 10-16 | ||
| Fraction * number of trials | 16 | 4.02 | 3.28 · 10-7 | ||
| Residuals | 475 |
A p-value < 2 · 1016 is a technical constraint and corresponds to the minimum floating-point number that can be stored by R on standard computers.