| Literature DB >> 31247031 |
Luana Fragoso1, Tuhin Paul1, Flaviu Vadan1, Kevin G Stanley1, Scott Bell2, Nathaniel D Osgood1,3.
Abstract
Patterns of spatial behavior dictate how we use our infrastructure, encounter other people, or are exposed to services and opportunities. Understanding these patterns through the analysis of data commonly available through commodity smartphones has become an important arena for innovation in both academia and industry. The resulting datasets can quickly become massive, indicating the need for concise understanding of the scope of the data collected. Some data is obviously correlated (for example GPS location and which WiFi routers are seen). Codifying the extent of these correlations could identify potential new models, provide guidance on the amount of data to collect, and even provide actionable features. However, identifying correlations, or even the extent of correlation, is difficult because the form of the correlation must be specified. Fractal-based intrinsic dimensionality directly calculates the minimum number of dimensions required to represent a dataset. We provide an intrinsic dimensionality analysis of four smartphone datasets over seven input dimensions, and empirically demonstrate an intrinsic dimension of approximately two.Entities:
Mesh:
Year: 2019 PMID: 31247031 PMCID: PMC6597084 DOI: 10.1371/journal.pone.0218966
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Datasets information.
| SHED7 | SHED8 | SHED9 | SHED10 | |
|---|---|---|---|---|
| Duration (days) | 38 | 30 | 40 | 29 |
| Participants | 63 | 75 | 87 | 108 |
| Records | 496M | 450M | 2.8B | 600M |
* M = million, B = billion
Dataset information before and after data preprocessing.
| Before | After | ||
|---|---|---|---|
| SHED7 | Participants | 63 | 63 |
| Records | 496M | 170,754 | |
| SHED8 | Participants | 75 | 60 |
| Records | 450M | 109,986 | |
| SHED9 | Participants | 87 | 86 |
| Records | 2.8B | 231,382 | |
| SHED10 | Participants | 108 | 105 |
| Records | 600M | 135,814 |
* M = million, B = billion
Fig 1n-D Tree data insertion algorithm.
Performance report.
| SHED7 | SHED8 | SHED9 | SHED10 | |
|---|---|---|---|---|
| Runtime (min) | 13 | 10 | 25 | 12 |
| Memory usage (GB) | 7 | 5 | 10 | 6 |
Fig 2Quad-Tree of SHED10.
Latitude and longitude partitioned in the first four levels.
Fig 3Proportion of nodes with data per level.
Fig 4TreeMap of nodes with data per level.
Fig 5Total nodes per level.
Fig 6Slope calculation of the ID of each dataset.
Black points represent the linear portion employed in slope calculation.
Ratio of leaves and IDs.
| Datasets | 8th level | 9th level | IDs |
|---|---|---|---|
| SHED7 | 54.43% | 69.25% | 1.86 |
| SHED8 | 58.14% | 74.72% | 1.82 |
| SHED9 | 32.13% | 45.61% | 1.90 |
| SHED10 | 45% | 64.69% | 1.85 |
Fig 7Principal component analysis. PCA result of the cumulative explained variance ratio.
This figure indicates the amount of data that can be described from 1 to the maximum of principal components (number of dimensions of the dataset).
Fig 8Correlation Matrix. Correlation analyses between all the 7 dimensions.
Strong correlations are closer to 1 or -1 values.
Eigenvectors and eigenvalues for the first three principal components for each dataset.
| PC1 | ||||
| Dimens | SHED7 | SHED8 | SHED9 | SHED10 |
| hour | ||||
| lat | -4.697e-04 | -8.496e-03 | -1.360e-03 | 1.464e-03 |
| lon | 1.354e-02 | 8.662e-03 | -7.657e-03 | -4.878e-03 |
| wifi | -1.811e-02 | -2.591e-02 | -2.807e-02 | -1.991e-02 |
| bat | ||||
| acc | -3.192e-03 | 6.417e-04 | -6.354e-04 | -8.427e-04 |
| stddev | 1.519e-03 | 2.109e-04 | -4.522e-04 | -2.834e-04 |
| eigenvalue | 0.097 | 0.094 | 0.091 | 0.090 |
| PC2 | ||||
| Dimens | SHED7 | SHED8 | SHED9 | SHED10 |
| hour | -2.798e-02 | 1.599e-02 | 1.095e-03 | |
| lat | 2.723e-02 | |||
| lon | 2.477e-02 | |||
| wifi | -2.626e-02 | 9.556e-02 | -2.653e-02 | 1.038e-02 |
| bat | -1.606e-01 | 8.329e-02 | -4.470e-02 | |
| acc | 2.118e-03 | -2.163e-04 | 7.845e-04 | 2.004e-03 |
| stddev | -1.507e-03 | 4.703e-03 | -1.231e-04 | -2.287e-05 |
| eigenvalue | 0.064 | 0.019 | 0.024 | 0.022 |
| PC3 | ||||
| Dimens | SHED7 | SHED8 | SHED9 | SHED10 |
| hour | -5.396e-03 | 9.333e-02 | 9.539e-02 | |
| lat | -2.996e-02 | -1.354e-01 | -1.321e-02 | |
| lon | 5.459e-02 | -4.664e-02 | ||
| wifi | -1.213e-01 | -1.387e-01 | -7.667e-02 | |
| bat | 2.958e-02 | |||
| acc | 8.667e-03 | -5.480e-03 | -4.923e-04 | 7.670e-04 |
| stddev | 3.851e-03 | 1.151e-02 | 3.386e-06 | 8.323e-05 |
| eigenvalue | 0.020 | 0.014 | 0.014 | 0.016 |
The most significant dimension is bold, the second is italic.