Literature DB >> 35036334

Prediction and evaluation of healthy and unhealthy status of COVID-19 patients using wearable device prototype data.

Shaik Asif Hussain1, Nizar Al Bassam1, Amer Zayegh1, Sana Al Ghawi1.   

Abstract

COVID-19 pandemic seriousness is making the whole world suffer due to inefficient medication and vaccines. The article prediction analysis is carried out with the dataset downloaded from the Application peripheral interface (API) designed explicitly for COVID-19 quarantined patients. The measured data is collected from a wearable device used for quarantined healthy and unhealthy patients. The wearable device provides data of temperature, heart rate, SPO2, blood saturation, and blood pressure timely for alerting the medical authorities and providing a better diagnosis and treatment. The dataset contains 1085 patients with eight features representing 490 COVID-19 infected and 595 standard cases. The work considers different parameters, namely heart rate, temperature, SpO2, bpm parameters, and health status. Furthermore, the real-time data collected can predict the health status of patients as infected and non-infected from measured parameters. The collected dataset uses a random forest classifier with linear and polynomial regression to train and validate COVID-19 patient data. The google colab is an Integral development environment inbuilt with python and Jupyter notebook with scikit-learn version 0.22.1 virtually tested on cloud coding tools. The dataset is trained and tested in 80% and 20% ratio for accuracy evaluation and avoid overfitting in the model. This analysis could help medical authorities and governmental agencies of every country respond timely and reduce the contamination of the disease.•The measured data provide a comprehensive mapping of disease symptoms to predict the health status. They can restrict the virus transmission and take necessary steps to control, mitigate and manage the disease.•Benefits in scientific research with Artificial Intelligence (AI) to tackle the hurdles in analyzing disease diagnosis.•The diagnosis results of disease symptoms can identify the severity of the patient to monitor and manage the difficulties for the outbreak caused.
© 2022 The Author(s). Published by Elsevier B.V.

Entities:  

Keywords:  AI model; Dataset; Healthcare; Pandemic; Quarantine; Wearable electronic device

Year:  2022        PMID: 35036334      PMCID: PMC8743393          DOI: 10.1016/j.mex.2022.101618

Source DB:  PubMed          Journal:  MethodsX        ISSN: 2215-0161


Specifications table

Methodology and data

The method used for the Data mining classification is Random Forest Algorithm for machine learning. Generic Machine Learning is employed to build a diagnosis model for COVID-19 patient symptoms with the steps involving support vector machine, Decision tree, and Random Forest, and logistic regression for processing the diagnosis data to detect COVID-19 cases (Fig. 3). The random forest algorithm is a classifier built to diagnose the disease from the signs and symptoms of COVID-19 patients [8]. The (Fig. 1) shows the design flow employed to judge the essential and represent an AI project which can build a model to gather every possible data and give us an insight understanding to analyze the health status of COVID-19 patients.
Fig. 3

Dataset modeling, classification, and prediction.

Fig. 1

Shows the RF model classification.

Shows the RF model classification.

Data descriptive and statistics

The dataset contains four measured values taken from a wearable device fixed with individual sensors of Temperature, blood pressure, heart rate, and SpO2 as given in Table 1. The dataset includes 1085 patients with eight features representing the proportion of balanced data (Table 3). Through the web platform, dataset is downloaded for the patients in .CSV, PDF, and Excel format consist of 8 columns and 1085 rows [12]. The source file is a collection of data from the given API link ProjectC (c19data.info) (Table 4). The proposed work has also been tested and implemented in the Anaconda tool (AEN 4.1 version) for data analysis [1].
Table 1

The parameters in the dataset.

Data parametersDescriptionAttributes
GenderPatient gender is an attribute primary spectrum of Health careMale or female
AgePatient's age is major influence associated to determine the health careLess than 80
Heart RatePulse defines heart beats per minute as either too fast or too slow< 100
TemperatureBody temperature in human to evaluate person's health< = 37
SpO2 SaturationIt measures the percentage of blood oxygen content and arterial saturation96–100%
Blood pressureMeasures the blood pressure in the circulatory system> 95
Table 3

Data Columns and types with count (total 8 columns).

#ColumnNon-Null countDtype
0Id1085 non-nullInt 64
1gender902 non-nullObject
2Age843 non-nullFloat 64
3Heart_rate1085 non-nullInt 64
4Temperature1085 non-nullFloat64
5SPO2_saturation1085 non-nullFloat64
6Bpm1085 non-nullInt 64
7Health_status1085 non-nullObject

Dtypes: float64(3), int64(3), object (2); Memory Usage: 67.9+ kB.

Table 4

Shows the dataset file with all the data included.

S. No.idGenderAgeHeart_rateTemperatureSpO2 SaturationbpmHealth_status
01Male66.07038.688.075Infected
12Female56.07439.688.070Infected
23Male46.08237.298.083Non Infected
34Female60.09038.698.075Non Infected
45Male58.07239.693.078Infected
10801081NaN24.011038.030.072Infected
10811082NaN35.011038.030.074Infected
10821083MaleNaN11038.030.068Infected
10831084MaleNaN11038.030.067Infected
10841085Male70.011038.030.070Infected
We can read the dataset as a supplementary file easily in .CSV forma (Table 4). The data is updated and stored from the above API link is provided. Random forest Algorithm is composed of different decision trees with supervised learning to perform both regression and classification (Fig. 4). The algorithm is a diverse model with decision trees, nodes, and leaves to classify unlabeled data [6]. In the proposed work, numerical data with irrelevant attributes such as Patient Id, gender, age, Heart rate, temperature, SpO2 saturation, blood pressure monitor [4]. The informative data values are selected to predict the health status and probability of infection among these attributes [3]. The algorithm shows the sample dataset of COVID-19 patients to associate a set of training documents with the selected features Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10. The data classification is carried out with the real-time measurements collected from different patients [13], commonly known as a definite response, to predict the output Y from the input variables X (Table 8). In actuality, the relationship is between response and predictors [4]. The background classification is carried out with nearest neighbors’ classifiers to obtain the linear model classification (Table 6).
Fig. 4

The design flow model of machine learning for COVID-19 dataset.

Table 2

Shows the dataset shape for first five rows from the loaded dataset.

S.No.Patient IDGenderAgeHeart_rateTemperatureSpO2 SaturationBPMHealth_Status
01Male66.07038.688.075Infected
12Female56.07439.688.070Infected
23Male46.08237.298.083Non-Infected
34Female60.09038.698.075Non-Infected
45Male58.07239.693.078Infected
Table 5

Shows the standard statistics calculated for the considered data.

idageHeart_rateTemperatureSpO2 Saturationbpm
Count1085.000000843.0000001085.0000001085.0000001085.0000001085.000000
Mean543.00000049.48368989.81290338.56248866.70746571.221198
std313.35682518.25533419.6857474.59241930.25106913.148559
Min1.0000000.25000047.00000036.00000020.00000044.000000
25%272.00000035.00000072.00000038.00000030.00000059.000000
50%543.00000051.00000091.00000038.10000082.00000072.000000
75%814.00000064.000000110.00000038.50000087.30000081.000000
max1085.00000096.000000120.00000095.000000340.000000109.000000
Table 6

Shows the correlation coefficient for the dataset.

idageHeart_ratetemperatureSpO2 Saturationbpm
ID1.000000−0.0335310.721335−0.082765−0.5588970.001511
Age−0.0335311.0000000.0839250.0914380.0330870.061741
Heart_rate0.7213350.0839251.000000−0.028797−0.2359190.284245
Temperature−0.0827650.091438−0.0287971.0000000.0542080.003302
SPO2 Saturation−0.5588970.033087−0.2359190.0542081.0000000.079131
bpm0.0015110.0617410.2842450.0033020.0791311.000000
Table 7

Shows the criterion of parameters for train and test points.

S. No.ParametersInfected (Non-Healthy)Non-Infected (Healthy)
1.TemperatureT > 37T < 37
2.Heartbeat variation> 100< 100
3.BPM<= 94> 95
4.SpO295–100%< 94%
Table 8

Dataset to measure Accuracy.

DescriptionParameters (X, Y)Percentage
Accuracy scoreY_test and Y-Predict0.9926470588235294
Training scoreX_Train and Y-Train0.968019680196802
Testing scoreX_train and X-Test0.9705882352941176
Table 9

Training and testing data for randomized values for 813 rows x 4 Columns.

Id:813 rows x 4 ColumnsHeart_rateTemperatureSpO2_ saturationbpm
86211338.530.067
6589738.585.066
2527836.998.067
70610238.585.053
2156437.885.081
103311038.030.075
76310938.587.382
83511238.530.077
5597037.630.057
6849538.585.094
Table 10

Training and testing data for randomized values for 272 rows x 4 columns.

Id: [272 rows x 4 columns]Heart_rateTemperatureSpO2_ saturationbpm
2046138.085.089
1836537.889.094
3568237.196.058
106911838.030.086
2728538.090.070
2558738.098.076
4955738.130.057
3197138.185.074
4936238.155.056
1447739.682.084
This work uses supervised learning with inputs and correct outputs to model the dataset over time to yield the desired outcome from the diagnostic devices to minimize the error sufficiently [10]. The method used to model is Random Forest classifier where scikit- learn version 0.22.1 and python version is 3.7.5 was used and tested on google colab. Multi-class classification gives the best understanding of the measured performance with one part of data as a training set and another for testing data [3]. The following steps explain the performance metric and splitting strategy, where the raw data is converted into a sequence to analyze from a viewpoint (Table 5). The proposed work has also been tested and implemented in the Anaconda tool (AEN 4.1 version) for data analysis [1]. The parameters in the dataset. Shows the dataset shape for first five rows from the loaded dataset. Data Columns and types with count (total 8 columns). Dtypes: float64(3), int64(3), object (2); Memory Usage: 67.9+ kB. Shows the dataset file with all the data included. Shows the standard statistics calculated for the considered data. Shows the correlation coefficient for the dataset. Shows the criterion of parameters for train and test points. Dataset to measure Accuracy. Training and testing data for randomized values for 813 rows x 4 Columns. Training and testing data for randomized values for 272 rows x 4 columns.

Pseudo code for RF algorithm

From the total ‘K’ features, select the informative attributes as ‘n’ features. Here the condition is n << K. Now, for the n features defined calculate the best point for splitting the features. Each node is classified as best split into daughter nodes. Perform the steps from 1 to 3 until the number of nodes reaches 1. Hence the n number of trees are generated to deploy and build the Random Forest model from 1 to 4.

Dataset classification

Random Forest algorithm is chosen as the best among the classifiers as it takes very little time for training and overfitting [2]. Also, its significant feature is the level of accuracy to predict class-wise error rate (Figs. 2– 5).
Fig. 2

Shows the process of classification with X and Y as actual and predicted values. (https://dsc-spidal.github.io/harp/docs/examples/rf/).

Fig. 5

The performance estimation and predictive model flow.

The tree classification of the RF model to the following steps. A binary tree is grown to classify the data. Nodes are defined to indicate and separate the data into two as daughter nodes. Splitting is done based on the conditions or scaled values. End nodes are known as terminal nodes. The prediction of the class is classified based on the majority of trees. The splitting criteria are classified based on the Gini criterion or conditions defined. Gini = pKL = Left node in proportion of class K. pKR = Right node in proportion of class K. Shows the process of classification with X and Y as actual and predicted values. (https://dsc-spidal.github.io/harp/docs/examples/rf/). Dataset modeling, classification, and prediction. The design flow model of machine learning for COVID-19 dataset. The performance estimation and predictive model flow. The performance estimation and predictive model flow. Regression The technique used to estimate the difference from independent feature to dependent features is linear regression which can easily forecast and predict the impact of relationship variables [5].

Algorithm procedure

A Random Forest algorithm extracts the subsamples from the given dataset to the ensemble datasets (Table 7). The dataset contains eight features, with four features are relevant attributes having a meaningful relationship. The algorithm works in two phases as random bootstrap sampling and decision trees creation. These methods together are used to classify the result for the prediction. In the first phase, it uses the bootstrap sampling method to bootstrap the samples as f1(x), f2 (x) ...FM(x) to obtain f(x) utilizing model averaging. The second phase defines the criteria in classifying the trees as daughter nodes and implements a simple vote [7]. This work considers a mathematical and AI approach for the real-time dataset of COVID-19 patients to determine the current state of infection from SpO2 saturation, temperature, heartbeat, and blood pressure values [9]. The current health state trained and tested from the dataset gives a data-driven model to monitor and forecast the pandemic health condition of different patients [11].

Illustrative Pseudo code with python programming

# Importing Libraries Import pandas as pd # Load dataset from your local drive DATASET_LOC = /path/downloads/covid-19-26.csv # Correlates all the attributes Correlation = correlation.colums Plt.scatter= Range Index(start=0, stop=1085, step=1) #InteractiveShell from IPython.core.interactiveshell InteractiveShell.ast_node_interactivity = ``all'' # split train and test and fit the model from sklearn.model_selection dcf= RandomForestClassifier() # Creating training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) #Inference on validation of dataset Pred= dcf.predict_model # Accuracy check and stats for inference accuracy_score(y_test,y_predict) lr.score(X_train, Y_train, X_test, Y_test) To implement and understand the work carried the following steps are defined. Load the dataset in google colab or visual code (Table 4). Add the proposed work in the Anaconda tool (AEN 4.1 version) for data analysis (Table 5). The dataset is loaded, and it displays the first five rows of the data frame packed in the above software tool used. The command used to display the five rows is df. head (Table 2) The dataset shape is obtained by using print statement as Dataset Shape: (1085, 8)

Data wrangling, collection, and cleaning

The raw data can perform meaningful analytics and train a machine learning model. The data stored in .CSV (comma Separated) file format determines the relevant attributes collected for patients of age and gender with symptoms and signs of SPO2 saturation, heart rate, blood pressure, and temperature (Fig. 6). The data cleaning step is to remove missing values and unwanted characters used in the data. df in the code indicates drawn data frame and the null values by using autocleaning and summing the predicted null values to perform data manipulation operations.
Fig. 6

The performance estimation and predictive model flow.

The correlation coefficients represent a relationship between two variables where it is a relationship between dependent and independent variables. The features for each attribute are separately shown in each column to define the variables in the dataset (Table 6). The above step avoids false repetition of the values. The below Eq. 1 represents with and for first and second variable values, m is the quantity information. When multiple lines are in a cell, an interactive shell defines the core simulation. In our dataset, relevant features from columns 3–7 are considered, with x defining the input response and y is predicted outputs. The head represents the first five rows of x and y (Table 2). For the dataset based on the conditions, split into train and test. This step maps the data in an optimal format for selecting a training set to process the data together, known as feature transformation. Splitting data into training and testing Sk learns function separates the train and test data from the source dataset by specifying the test size and train size (Table 10). The model is fitted based on the parameters assigned in the random forest model. This model specifies the parameters such as features per node, num Trees, max Tree depth, RF predictor, confusion matrix. It set the best fit model for the random forest classifier. In this step, the algorithm is trained for evaluation to ensure proper testing. The data is split with 80% for training and 20% for testing to refine and optimize the model over time (Table 9). The model is classified with the dataset to measure accuracy by using binary classificatory as the following (Table 8). Where True positive (TP), True negative (TN), False positive (FP), and False Negative (FN) are the metrics for non-binary classificatory, the data of machine learning model determines the highest probability as overall accuracy where a correct number of segments are counted as an actual class and divided by the total number of elements. Model validation: The training and testing data are the same, where the data is split into training data to test the final model. The data has classes to define overfitting and underfitting to generalize the data. In this work, overfitting applies to the training data as the value obtained is too close to the outcome (Table 9). To predict the classification and its score, a confusion matrix is used. The matrix information collects actual and predicted information in a separate column specifying the health status Table 2, Table 3, Table 4.

Conclusion

This simulation study has analyzed the risk of COVID-19 disease progression using random forest classifier algorithm. The eight features intensify the uncertainty to forecast the disease progression, which has brought health and financial crisis. The result has predicted the accuracy score of 99.26%, with training and testing scores separately as required. The 1085 samples used have total volatility to spillover during diversity. The comprehensive open-source framework of google colab uses Anaconda AEN 4.1 version with designed efficiency to parameterize many body functions in artificial neural networks. The random forest classifier algorithm shows the sample dataset of COVID-19 patients to associate a set of training documents with the selected features. The jupyter notebook software offers a real-time simulation with attributes for informative data values, which are determined to predict the health status and probability of infection. The data analysis used is to predict the classification and its score confusion matrix as 96.8 and 97.05%. This performance uses a classification process of two classes in the form of the available data matrix. The matrix information collects actual and predicted information in a separate column specifying the health status.

CRediT authorship contribution statement

Shaik Asif Hussain: Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Writing – review & editing. Nizar Al Bassam: Conceptualization. Amer Zayegh: Software, Validation. Sana Al Ghawi: Writing – review & editing, Methodology.

Declaration of Competing Interest

“This work was supported in part by Ministry of Higher Education Research and Innovation (MOHERI) formerly known as The Research council (TRC) of Oman under COVID-19 program Block Funding Agreement No TRC/CRP/MEC/COVID-19/20/09. The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.
Subject AreaEngineering
More specific subject areaData Mining- Artificial Intelligence
Method nameRandom Forest Classifier Algorithm used to train and test the data to predict the disease progression
Name and reference of original methodNA
Resource availabilityhttps://doi.org/10.5281/zenodo.4766192http://www.c19data.info/index.php/admin/patients
  3 in total

1.  IoT based wearable device to monitor the signs of quarantined remote patients of COVID-19.

Authors:  Nizar Al Bassam; Shaik Asif Hussain; Ammar Al Qaraghuli; Jibreal Khan; E P Sumesh; Vidhya Lavanya
Journal:  Inform Med Unlocked       Date:  2021-05-08

Review 2.  Wearable Hardware Design for the Internet of Medical Things (IoMT).

Authors:  Fayez Qureshi; Sridhar Krishnan
Journal:  Sensors (Basel)       Date:  2018-11-07       Impact factor: 3.576

  3 in total
  1 in total

1.  Towards Multimodal Equipment to Help in the Diagnosis of COVID-19 Using Machine Learning Algorithms.

Authors:  Ana Cecilia Villa-Parra; Ismael Criollo; Carlos Valadão; Leticia Silva; Yves Coelho; Lucas Lampier; Luara Rangel; Garima Sharma; Denis Delisle-Rodríguez; John Calle-Siguencia; Fernando Urgiles-Ortiz; Camilo Díaz; Eliete Caldeira; Sridhar Krishnan; Teodiano Bastos-Filho
Journal:  Sensors (Basel)       Date:  2022-06-08       Impact factor: 3.847

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.