| Literature DB >> 35214509 |
László Barna Iantovics1, Călin Enăchescu1.
Abstract
Sometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by simulation. Regarding simulated data, a limitation could consist in the fact that the data are not appropriate for research, based on poor quality or limited quantity. In such cases, the design of algorithms that are tested on that data does not give credible results. For avoiding such situations, we consider that mathematically grounded data-quality assessments should be designed according to the specific type of problem that must be solved. In this paper, we approach a multivariate type of prediction whose results finally can be used for binary classification. We propose the use of a mathematically grounded data-quality assessment, which includes, among other things, the analysis of predictive power of independent variables used for prediction. We present the assumptions that should be passed by the synthetic data. Different threshold values are established by a human assessor. In the case of research data, if all the assumptions pass, then we can consider that the data are appropriate for research and can be applied by even using other methods for solving the same type of problem. The applied method finally delivers a classification table on which can be applied any indicators of performed classification quality, such as sensitivity, specificity, accuracy, F1 score, area under curve (AUC), receiver operating characteristics (ROC), true skill statistics (TSS) and Kappa coefficient. These indicators' values offer the possibility of comparison of the results obtained by applying the considered method with results of any other method applied for solving the same type of problem. For evaluation and validation purposes, we performed an experimental case study on a novel synthetic dataset provided by the well-known UCI data repository.Entities:
Keywords: Industry 4.0; classification problem; data-quality assessment; explainable artificial intelligence; machine learning; model fit; prediction problem; predictive power; sensor data; smart factory; smart sensor; statistical modeling
Mesh:
Year: 2022 PMID: 35214509 PMCID: PMC8876977 DOI: 10.3390/s22041608
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Variables in the equation considering the intercept-only model.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Constant |
|
|
|
|
|
Prediction power classification established by HA.
| Name of the Class | Interval |
|---|---|
| “unsatisfactory” prediction power | <10% |
| “week” prediction power | (10%, 20%) |
| “appropriate” prediction power | (20%, 30%) |
| “good” prediction power | (30%, 40%) |
| “very good” prediction power | ≥40% |
Structure of classification table for machine-failure prediction.
| Predicted | ||||
|---|---|---|---|---|
| Machine Failure | Percentage Correct | |||
| NO | YES | |||
|
|
|
| ||
|
|
| |||
|
|
| |||
Variables in the equation with predictor variables included.
|
|
|
|
|
| CI of | ||
|---|---|---|---|---|---|---|---|
|
|
| ||||||
| 0.78 | 0.072 | 114.079 | 0.01 | 22.65 | 11.879 | 22.495 | |
|
| −0.75 | 0.096 | 59.442 | 0.03 | 0.476 | 0.394 | 0.575 |
|
| 0.02 | 0.001 | 473.247 | 0.83 | 1.012 | 1.011 | 1.013 |
| Constant | −37.7 | 14.641 | 6.278 | 0.012 | 0 | ||
Snapshot of the data used in the research.
| UDI | Product ID |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| 1 | M14860 | M | 298.1 | 308.6 | 1551 | 42.8 | 0 | 0 |
| 2 | L47181 | L | 298.2 | 308.7 | 1408 | 46.3 | 3 | 0 |
| … | … | … | … | … | … | … | … | … |
| 78 | L47257 | L | 298.8 | 308.9 | 1455 | 41.3 | 208 | 1 |
| … | … | … | … | … | … | … | … | … |
Results of the data normality verification.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Statistic | 0.067 | 0.49 | 0.104 | 0.009 | 0.06 |
| 0 | 0 | 0 | 0.64 | 0 | |
| QQ plot |
|
|
|
|
|
| Normality assumption | No | No | No | Yes | No |
Figure 1QQ plot for V2.
Figure 2QQ plot for V3.
Figure 3QQ plot for V4.
Figure 4QQ plot for V5.
Figure 5QQ plot for V6.
Variables in the equation considering the intercept-only model.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Constant | −3.35 | 0.055 | 3675.133 | ~0 | 0.035 |
Result of the Omnibus Test of Model Coefficients.
|
|
| |
|---|---|---|
| Step | 1038.064 | ~0 |
| Block | 1038.064 | ~0 |
| Model | 1038.064 | ~0 |
Result of the Hosmer–Lemeshow test.
|
|
|
|---|---|
| 13.39 | 0.099 |
Obtained pseudo-R2 values.
| −2 Log-Likelihood | Cox and Snell R2 | Nagelkerke R2 |
|---|---|---|
| 1922.894 | 0.099 | 0.385 |
Performed classification results, with the cut value set to 0.5.
| Known | Predicted | ||
|---|---|---|---|
| Machine Failure | Percentage Correct | ||
| NO Failure (0) | Failure (1) | ||
| NO Failure (0) | 9635 | 26 | 99.7% |
| Failure (1) | 271 | 68 | 20.1% |
| Accuracy | 97.0% | ||
Variables in the equation with the predictor variables included.
|
|
|
|
|
| CI of | ||
|---|---|---|---|---|---|---|---|
|
|
| ||||||
|
| 0.772 | 0.072 | 114.079 | ~0 | 2.165 | 1.879 | 2.495 |
|
| −0.743 | 0.096 | 59.442 | ~0 | 0.476 | 0.394 | 0.575 |
|
| 0.012 | 0.001 | 473.247 | ~0 | 1.012 | 1.011 | 1.013 |
|
| 0.281 | 0.011 | 599.492 | ~0 | 1.324 | 1.295 | 1.354 |
|
| 0.013 | 0.001 | 138.315 | ~0 | 1.013 | 1.011 | 1.016 |
| Constant | −36.69 | 14.641 | 6.278 | 0.012 | 0 | ||
Performed classification results from using the method with the cut value 0.4.
| Known Machine Failure | Predicted | |||
|---|---|---|---|---|
| Machine Failure | Percentage Correct | |||
| 0 | 1 | |||
| NO Failure | 0 | 9609 | 52 | 99.5% |
| Failure | 1 | 243 | 96 | 28.3% |
| Accuracy | 97.1% | |||