| Literature DB >> 31635371 |
Yan Zhong1, Simon Fong2, Shimin Hu3, Raymond Wong4, Weiwei Lin5.
Abstract
The Internet of Things (IoT) and sensors are becoming increasingly popular, especially in monitoring large and ambient environments. Applications that embrace IoT and sensors often require mining the data feeds that are collected at frequent intervals for intelligence. Despite the fact that such sensor data are massive, most of the data contents are identical and repetitive; for example, human traffic in a park at night. Most of the traditional classification algorithms were originally formulated decades ago, and they were not designed to handle such sensor data effectively. Hence, the performance of the learned model is often poor because of the small granularity in classification and the sporadic patterns in the data. To improve the quality of data mining from the IoT data, a new pre-processing methodology based on subspace similarity detection is proposed. Our method can be well integrated with traditional data mining algorithms and anomaly detection methods. The pre-processing method is flexible for handling similar kinds of sensor data that are sporadic in nature that exist in many ambient sensing applications. The proposed methodology is evaluated by extensive experiment with a collection of classical data mining models. An improvement over the precision rate is shown by using the proposed method.Entities:
Keywords: Internet of Things; preprocessing; sensor data; subspace similarity
Year: 2019 PMID: 31635371 PMCID: PMC6832605 DOI: 10.3390/s19204536
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Kinds of problem in the Internet of Things.
Figure 2Process of transfering raw training dataset to new training dataset.
Figure 3Process of transferring raw test dataset to new test dataset.
Description of various datasets.
| Dataset | Instances | Features | Labels | Noisy |
|---|---|---|---|---|
| LD | 164860 | 8 | 8 | 10000 |
| HHAR | 43930257 | 16 | 6 | 100000 |
| AReM | 42240 | 6 | 4 | 1000 |
Figure 4Visualization of AreM dataset which is a graph of time-series of five feature values: it shows that all the time-series are stationary without any significant and distinctive shape, fluctuating within a limited range of values.
The performance of TBSS and other pre-processing methods on localization recognition dataset.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| Localization Dataset ( | |||
| KNN | 0.1721 ± 0.018 | 0.3015 ± 0.011 | 0.2134 ± 0.019 |
| KNN+PCA | 0.1154 ± 0.016 | 0.2125 ± 0.013 | 0.1363 ± 0.014 |
| KNN+Normalize | 0.1422 ± 0.012 | 0.2614 ± 0.015 | 0.1832 ± 0.009 |
| KNN+IPCA | 0.0134 ± 0.017 | 0.1683 ± 0.029 | 0.1035 ± 0.027 |
| KNN+TBSS | 0.0467 ± 0.028 | 0.0454 ± 0.014 | 0.0335 ± 0.018 |
| DT | 0.1811 ± 0.015 | 0.2436 ± 0.025 | 0.2021 ± 0.019 |
| DT+PCA | 0.1043 ± 0.018 | 0.1653 ± 0.029 | 0.1137 ± 0.025 |
| DT+Normalize | 0.1457 ± 0.028 | 0.2134 ± 0.014 | 0.1612 ± 0.018 |
| DT+IPCA | 0.0854 ± 0.035 | 0.1210 ± 0.023 | 0.0814 ± 0.037 |
| DT+TBSS | 0.3264 ± 0.016 | 0.0978 ± 0.015 | 0.1248 ± 0.026 |
| SVM | 0.1321 ± 0.035 | 0.3014 ± 0.029 | 0.1731 ± 0.013 |
| SVM+PCA | 0.0934 ± 0.019 | 0.2426 ± 0.028 | 0.1367 ± 0.018 |
| SVM+Normalize | 0.0823 ± 0.031 | 0.2464 ± 0.012 | 0.1134 ± 0.019 |
| SVM+IPCA | 0.1012 ± 0.027 | 0.2810 ± 0.036 | 0.1521 ± 0.021 |
| SVM+TBSS | 0.1634 ± 0.022 | 0.0767 ± 0.010 | 0.0936 ± 0.019 |
| LG | 0.1311 ± 0.029 | 0.3023 ± 0.019 | 0.1763 ± 0.016 |
| LG+PCA | 0.0325 ± 0.026 | 0.0312 ± 0.028 | 0.0853 ± 0.019 |
| LG+Normalize | 0.0821 ± 0.009 | 0.2516 ± 0.008 | 0.1274 ± 0.007 |
| LG+IPCA | 0.1234 ± 0.020 | 0.2346 ± 0.010 | 0.1474 ± 0.012 |
| LG+TBSS | 0.2668 ± 0.025 | 0.1121 ± 0.019 | 0.1453 ± 0.016 |
| NB | 0.1333 ± 0.022 | 0.2745 ± 0.018 | 0.1624 ± 0.017 |
| NB+PCA | 0.0929 ± 0.014 | 0.2535 ± 0.018 | 0.1374 ± 0.019 |
| NB+Normalize | 0.1153 ± 0.013 | 0.2074 ± 0.018 | 0.1136 ± 0.018 |
| NB+IPCA | 0.1167 ± 0.012 | 0.2963 ± 0.011 | 0.1526 ± 0.019 |
| NB+TBSS | 0.2442 ± 0.011 | 0.0947 ± 0.013 | 0.1163 ± 0.029 |
The performance of TBSS and other pre-processing methods on AReM recognition dataset.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| AReM Dataset ( | |||
| KNN | 0.1743 ± 0.016 | 0.1945 ± 0.014 | 0.1324 ± 0.012 |
| KNN+PCA | 0.3053 ± 0.016 | 0.2631 ± 0.018 | 0.2734 ± 0.016 |
| KNN+Normalize | 0.1784 ± 0.014 | 0.1982 ± 0.028 | 0.1335 ± 0.014 |
| KNN+IPCA | 0.3033 ± 0.101 | 0.2613 ± 0.018 | 0.2743 ± 0.026 |
| KNN+TBSS | 0.3042 ± 0.026 | 0.2634 ± 0.025 | 0.2733 ± 0.018 |
| DT | 0.1832 ± 0.010 | 0.2421 ± 0.015 | 0.2053 ± 0.015 |
| DT+PCA | 0.3453 ± 0.013 | 0.2442 ± 0.018 | 0.2042 ± 0.016 |
| DT+Normalize | 0.4935 ± 0.024 | 0.2674 ± 0.032 | 0.2434 ± 0.018 |
| DT+IPCA | 0.1845 ± 0.012 | 0.1534 ± 0.012 | 0.1057 ± 0.014 |
| DT+TBSS | 0.3037 ± 0.019 | 0.3353 ± 0.018 | 0.3073 ± 0.015 |
| SVM | 0.1335 ± 0.012 | 0.3073 ± 0.022 | 0.1753 ± 0.017 |
| SVM+PCA | 0.2675 ± 0.010 | 0.2443 ± 0.018 | 0.2123 ± 0.016 |
| SVM+Normalize | 0.3554 ± 0.017 | 0.3943 ± 0.012 | 0.2532 ± 0.012 |
| SVM+IPCA | 0.2642 ± 0.010 | 0.2421 ± 0.018 | 0.2122 ± 0.016 |
| SVM+TBSS | 0.2844 ± 0.013 | 0.2142 ± 0.013 | 0.2242 ± 0.009 |
| LG | 0.1334 ± 0.013 | 0.3034 ± 0.009 | 0.1732 ± 0.016 |
| LG+PCA | 0.2586 ± 0.010 | 0.2394 ± 0.028 | 0.1975 ± 0.018 |
| LG+Normalize | 0.2832 ± 0.020 | 0.2212 ± 0.042 | 0.2523 ± 0.032 |
| LG+IPCA | 0.2212 ± 0.010 | 0.1863 ± 0.038 | 0.1524 ± 0.036 |
| LG+TBSS | 0.3234 ± 0.029 | 0.2413 ± 0.014 | 0.2523 ± 0.021 |
| NB | 0.1313 ± 0.022 | 0.2726 ± 0.025 | 0.1613 ± 0.027 |
| NB+PCA | 0.3086 ± 0.010 | 0.2426 ± 0.018 | 0.2111 ± 0.016 |
| NB+Normalize | 0.3132 ± 0.023 | 0.1713 ± 0.009 | 0.0239 ± 0.010 |
| NB+IPCA | 0.2845 ± 0.015 | 0.2234 ± 0.027 | 0.2132 ± 0.023 |
| NB+TBSS | 0.3477 ± 0.020 | 0.2663 ± 0.017 | 0.1856 ± 0.012 |
The performance of TBSS and other pre-processing methods on activity recognition dataset.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| Activity Recognition ( | |||
| KNN | 0.1734 ± 0.027 | 0.3023 ± 0.011 | 0.2113 ± 0.019 |
| KNN+PCA | 0.1543 ± 0.016 | 0.1434 ± 0.014 | 0.1665 ± 0.025 |
| KNN+Normalize | 0.1932 ± 0.028 | 0.2075 ± 0.024 | 0.2157 ± 0.023 |
| KNN+IPCA | 0.1374 ± 0.016 | 0.1834 ± 0.014 | 0.1656 ± 0.015 |
| KNN+TBSS | 0.0934 ± 0.015 | 0.1074 ± 0.013 | 0.0455 ± 0.013 |
| DT | 0.2664 ± 0.014 | 0.2767 ± 0.015 | 0.2568 ± 0.014 |
| DT+PCA | 0.2964 ± 0.016 | 0.2221 ± 0.018 | 0.2468 ± 0.025 |
| DT+Normalize | 0.2635 ± 0.013 | 0.2867 ± 0.013 | 0.2434 ± 0.024 |
| DT+IPCA | 0.3168 ± 0.024 | 0.2878 ± 0.023 | 0.2767 ± 0.027 |
| DT+TBSS | 0.3441 ± 0.027 | 0.2342 ± 0.039 | 0.3084 ± 0.024 |
| SVM | 0.1345 ± 0.022 | 0.3066 ± 0.012 | 0.1753 ± 0.037 |
| SVM+PCA | 0.3105 ± 0.026 | 0.2368 ± 0.046 | 0.2323 ± 0.013 |
| SVM+Normalize | 0.3243 ± 0.023 | 0.2163 ± 0.035 | 0.2435 ± 0.033 |
| SVM+IPCA | 0.2532 ± 0.016 | 0.2463 ± 0.046 | 0.2513 ± 0.014 |
| SVM+TBSS | 0.3021 ± 0.035 | 0.2147 ± 0.033 | 0.2373 ± 0.013 |
| LG | 0.1924 ± 0.013 | 0.2432 ± 0.010 | 0.1703 ± 0.010 |
| LG+PCA | 0.2523 ± 0.029 | 0.2154 ± 0.024 | 0.2163 ± 0.014 |
| LG+Normalize | 0.2935 ± 0.026 | 0.1964 ± 0.016 | 0.2243 ± 0.035 |
| LG+IPCA | 0.2626 ± 0.037 | 0.2172 ± 0.013 | 0.2025 ± 0.015 |
| LG+TBSS | 0.2834 ± 0.016 | 0.2053 ± 0.014 | 0.2153 ± 0.015 |
| NB | 0.2653 ± 0.024 | 0.2753 ± 0.025 | 0.3026 ± 0.014 |
| NB+PCA | 0.2923 ± 0.014 | 0.2845 ± 0.015 | 0.2624 ± 0.053 |
| NB+Normalize | 0.3129 ± 0.012 | 0.2965 ± 0.023 | 0.3024 ± 0.012 |
| NB+IPCA | 0.3123 ± 0.023 | 0.2754 ± 0.023 | 0.2636 ± 0.014 |
| NB+TBSS | 0.3534 ± 0.017 | 0.2623 ± 0.034 | 0.2753 ± 0.027 |
Different size of subspace and its performance.
| Localization Dataset | |||
|---|---|---|---|
|
|
|
|
|
| KNN | 0.0612 ± 0.012 | 0.0723 ± 0.011 | 0.0353 ± 0.012 |
| DT | 0.3134 ± 0.032 | 0.0713 ± 0.017 | 0.1462 ± 0.016 |
| SVM | 0.2052 ± 0.013 | 0.0746 ± 0.013 | 0.1134 ± 0.032 |
| LG | 0.2852 ± 0.038 | 0.1153 ± 0.013 | 0.1524 ± 0.013 |
| NB | 0.2626 ± 0.011 | 0.1163 ± 0.013 | 0.1323 ± 0.012 |
| KNN | 0.0253 ± 0.010 | 0.0623 ± 0.011 | 0.0352 ± 0.011 |
| DT | 0.2353 ± 0.022 | 0.1275 ± 0.041 | 0.1253 ± 0.023 |
| SVM | 0.3125 ± 0.012 | 0.1035 ± 0.012 | 0.1153 ± 0.012 |
| LG | 0.1834 ± 0.042 | 0.0734 ± 0.014 | 0.0835 ± 0.014 |
| NB | 0.1953 ± 0.031 | 0.0922 ± 0.021 | 0.1073 ± 0.011 |
| KNN | 0.0134 ± 0.007 | 0.0324 ± 0.008 | 0.0132 ± 0.005 |
| DT | 0.2035 ± 0.056 | 0.1652 ± 0.022 | 0.1764 ± 0.012 |
| SVM | 0.4212 ± 0.010 | 0.1662 ± 0.024 | 0.1823 ± 0.032 |
| LG | 0.1626 ± 0.033 | 0.1023 ± 0.012 | 0.1043 ± 0.042 |
| NB | 0.3017 ± 0.028 | 0.0845 ± 0.012 | 0.1115 ± 0.051 |
Different iteration times to the performance of the classification in LG dataset.
| Parameters | Epoch | Precision | Recall | F1 score |
|---|---|---|---|---|
|
| 10 | 0.2952 ± 0.025 | 0.0942 ± 0.012 | 0.1343 ± 0.022 |
| 20 | 0.3045 ± 0.032 | 0.1084 ± 0.032 | 0.1253 ± 0.022 | |
| 30 | 0.3264 ± 0.023 | 0.1311 ± 0.023 | 0.1442 ± 0.014 | |
| 40 | 0.3323 ± 0.022 | 0.1434 ± 0.013 | 0.1503 ± 0.035 | |
| 50 | 0.3353 ± 0.015 | 0.1334 ± 0.024 | 0.1452 ± 0.024 | |
|
| 10 | 0.3035 ± 0.010 | 0.0823 ± 0.013 | 0.1224 ± 0.022 |
| 20 | 0.3130 ± 0.012 | 0.1026 ± 0.012 | 0.1164 ± 0.009 | |
| 30 | 0.3364 ± 0.024 | 0.1282 ± 0.034 | 0.1033 ± 0.013 | |
| 40 | 0.3464 ± 0.013 | 0.1374 ± 0.024 | 0.1079 ± 0.031 | |
| 50 | 0.3457 ± 0.022 | 0.1275 ± 0.015 | 0.1135 ± 0.023 | |
|
| 10 | 0.4223 ± 0.010 | 0.1452 ± 0.025 | 0.1653 ± 0.024 |
| 20 | 0.4232 ± 0.030 | 0.1674 ± 0.024 | 0.1854 ± 0.022 | |
| 30 | 0.4423 ± 0.012 | 0.1534 ± 0.016 | 0.1734 ± 0.033 | |
| 40 | 0.4564 ± 0.020 | 0.1241 ± 0.015 | 0.1477 ± 0.032 | |
| 50 | 0.4542 ± 0.028 | 0.1537 ± 0.025 | 0.1652 ± 0.025 |