| Literature DB >> 35885762 |
Dong-Her Shih1, Pai-Ling Shih2, Ting-Wei Wu1, Cheng-Jung Li1, Ming-Hung Shih3.
Abstract
Since December 2019, COVID-19 has been raging worldwide. To prevent the spread of COVID-19 infection, many countries have proposed epidemic prevention policies and quickly administered vaccines, However, under facing a shortage of vaccines, the United States did not put forward effective epidemic prevention policies in time to prevent the infection from expanding, resulting in the epidemic in the United States becoming more and more serious. Through "The COVID Tracking Project", this study collects medical indicators for each state in the United States from 2020 to 2021, and through feature selection, each state is clustered according to the epidemic's severity. Furthermore, through the confusion matrix of the classifier to verify the accuracy of the cluster analysis, the study results show that the Cascade K-means cluster analysis has the highest accuracy. This study also labeled the three clusters of the cluster analysis results as high, medium, and low infection levels. Policymakers could more objectively decide which states should prioritize vaccine allocation in a vaccine shortage to prevent the epidemic from continuing to expand. It is hoped that if there is a similar epidemic in the future, relevant policymakers can use the analysis procedure of this study to determine the allocation of relevant medical resources for epidemic prevention according to the severity of infection in each state to prevent the spread of infection.Entities:
Keywords: COVID-19; classification validation; clustering analysis; machine learning; vaccine distribution
Year: 2022 PMID: 35885762 PMCID: PMC9323689 DOI: 10.3390/healthcare10071235
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Dataset.
| Item | Variables | Attribute | Item | Variables | Attribute |
|---|---|---|---|---|---|
| 1 | date | String | 24 | inlcucumulative | Numerical |
| 2 | state | String | 25 | inlcuCurrently | Numerical |
| 3 | dataQualityGrade | String | 26 | onVentilatorCumulative | Numerical |
| 4 | positive | Numerical | 27 | onVentilatorCurrently | Numerical |
| 5 | positive Increase | Numerical | 28 | death | Numerical |
| 6 | probable Cases | Numerical | 29 | death Increase | Numerical |
| 7 | positiveScore | Numerical | 30 | death Probable | Numerical |
| 8 | positiveCasesViral | Numerical | 31 | death Confirmed | Numerical |
| 9 | positiveTestsViral | Numerical | 32 | recovered | Numerical |
| 10 | positiveTestsPeopleAntibody | Numerical | 33 | totaltestResults | Numerical |
| 11 | positiveTestsAntibody | Numerical | 34 | totalTestResultsIncrease | Numerical |
| 12 | positiveTestsPeopleAntigen | Numerical | 35 | totalTestsViral | Numerical |
| 13 | positiveTestsAntigen | Numerical | 36 | totalTestsViralIncrease | Numerical |
| 14 | negative | Numerical | 37 | totalTestsPeopleViral | Numerical |
| 15 | negativeTestsViral | Numerical | 38 | totalTestsPeopleViralIncrease | Numerical |
| 16 | negativeTestsPeopleAntibody | Numerical | 39 | totalTestEncountersViral | Numerical |
| 17 | negativeTestsAntibody | Numerical | 40 | totalTestEncountersViralIncrease | Numerical |
| 18 | negativeIncrease | Numerical | 41 | totalTestsAntigen | Numerical |
| 19 | Pending | Numerical | 42 | totalTestsPeopleAntigen | Numerical |
| 20 | hospitalized | Numerical | 43 | totalTestsAntibody | Numerical |
| 21 | hospitalized Increase | Numerical | 44 | totalTestsPeopleAntibody | Numerical |
| 22 | hospitalized Cumulative | Numerical | |||
| 23 | hospitalized Currently | Numerical |
Figure 1Clustering analysis and classifications scenario.
Comparison of different clustering techniques.
| Category | Hierarchical | Density-Based | Graph-Based | Partitioning |
|---|---|---|---|---|
| Based on | Linkage methods | Density accessibility | Graph theory | Mean Centroid |
| Type of Data | Numerical | Numerical | Mix data | Numerical |
| Pros | Easy to implement | Found clusters of arbitrary shapes and sizes | Perform well with complex shapes of data | Easy to implement |
| Cons | Fails on larger sets | Doe not work well in high dimensionality data. | Can be costly to compute | Unable to handle noisy data and outliers |
Figure 2WEKA Feature selection process.
Figure 3Clustering and classification validation with WEKA.
Dataset descriptive statistics.
| Data Field | Minimum | Maximum | Mean | Standard Deviation |
|---|---|---|---|---|
| Death | 3.563 | 20,146.993 | 3012.726 | 4063.720 |
| deathConfirmed | 0.000 | 11,873.819 | 1612.264 | 2564.015 |
| deathIncrease | 0.229 | 139.083 | 23.802 | 28.186 |
| deathProbable | 0.000 | 1557.983 | 116.001 | 241.290 |
| Hospitalized | 0.000 | 74,908.536 | 6795.447 | 12,375.868 |
| hospitalizedCumulative | 0.000 | 74,908.536 | 6795.447 | 12,375.868 |
| hospitalizedCurrently | 0.000 | 5706.697 | 1016.505 | 1178.133 |
| hospitalizedIncrease | −0.868 | 574.361 | 52.537 | 93.318 |
| inIcuCumulative | 0.000 | 4167.848 | 374.140 | 935.623 |
| inIcuCurrently | 0.000 | 1594.500 | 198.782 | 310.440 |
| negative | 3193.583 | 5,676,868.213 | 1,217,574.982 | 1,393,361.630 |
| negativeIncrease | 225.347 | 66,790.590 | 10,430.280 | 13,289.475 |
| negativeTestsAntibody | 0.000 | 274,785.535 | 9939.138 | 42,390.124 |
| negativeTestsPeopleAntibody | 0.000 | 307,446.436 | 9400.504 | 44,786.195 |
| negativeTestsViral | 0.000 | 5,069,123.190 | 376,163.876 | 913,586.361 |
| onVentilatorCumulative | 0.000 | 1210.757 | 44.743 | 189.114 |
| onVentilatorCurrently | 0.000 | 383.805 | 77.040 | 115.523 |
| positive | 290.354 | 689,808.865 | 126,025.819 | 137,216.541 |
| positiveCasesViral | 0.000 | 644,108.814 | 99,837.740 | 122,787.805 |
| positiveIncrease | 16.833 | 6633.789 | 1390.662 | 1405.567 |
| positiveScore | 0.000 | 0.000 | 0.000 | 0.000 |
| positiveTestsAntibody | 0.000 | 32,553.559 | 2128.500 | 6739.150 |
| positiveTestsAntigen | 0.000 | 28,745.608 | 1748.258 | 5172.510 |
| positiveTestsPeopleAntibody | 0.000 | 30,843.331 | 969.265 | 4476.315 |
| positiveTestsPeopleAntigen | 0.000 | 19,372.812 | 912.666 | 3418.733 |
| positiveTestsViral | 0.000 | 746,688.084 | 69,991.507 | 149,301.079 |
| recovered | 0.000 | 548,376.917 | 56,375.980 | 91,104.014 |
| totalTestEncountersViral | 0.000 | 6,252,107.282 | 436,938.962 | 1,266,757.976 |
| totalTestEncountersViralIncrease | 0.000 | 70,634.798 | 4496.100 | 13,115.623 |
| totalTestResults | 3643.785 | 6,252,107.282 | 1,575,543.098 | 1,650,858.474 |
| totalTestResultsIncrease | 250.514 | 70,634.798 | 14,555.709 | 15,895.585 |
| totalTestsAntibody | 0.000 | 336,182.488 | 33,358.436 | 81,366.549 |
| totalTestsAntigen | 0.000 | 329,705.523 | 23,679.767 | 57,553.140 |
| totalTestsPeopleAntibody | 0.000 | 338,396.716 | 13,807.533 | 51,122.147 |
| totalTestsPeopleAntigen | 0.000 | 121,896.261 | 6203.051 | 21,180.165 |
| totalTestsPeopleViral | 0.000 | 3,939,157.669 | 424,758.638 | 718,630.307 |
| totalTestsPeopleViralIncrease | −251.653 | 31,530.365 | 3374.921 | 5665.766 |
| totalTestsViral | 0.000 | 5,972,478.403 | 1,244,215.985 | 1,591,549.099 |
| totalTestsViralIncrease | 0.000 | 52,143.061 | 11,304.525 | 14,366.747 |
PCA feature selection results.
| Features | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Group | Pc1 | Pc2 | Pc3 | Pc4 | Pc5 | Pc6 | Pc7 | Pc8 | Pc9 | Pc10 | Pc11 |
| Variation | 15.89 | 6.05 | 4.69 | 2.64 | 1.90 | 1.38 | 1.11 | 0.92 | 0.64 | 0.54 | 0.49 |
| Variation Percentage | 0.42 | 0.16 | 0.12 | 0.07 | 0.05 | 0.04 | 0.03 | 0.02 | 0.02 | 0.01 | 0.01 |
| Cumulative contribution ratio | 0.42 | 0.58 | 0.70 | 0.77 | 0.82 | 0.85 | 0.88 | 0.91 | 0.93 | 0.94 | 0.96 |
Feature selection sorting.
| Rank | PCA | IG | GR | Average Rank |
|---|---|---|---|---|
| 1 | A30 | A2 | A39 | 2.7302 |
| 2 | A19 | A19 | A11 | 2.7302 |
| 3 | A21 | A30 | A13 | 2.7302 |
| 4 | A31 | A31 | A16 | 2.7302 |
| 5 | A8 | A13 | A18 | 2.7302 |
| 6 | A15 | A21 | A19 | 2.7302 |
| 7 | A24 | A12 | A38 | 2.7302 |
| 8 | A34 | A4 | A12 | 2.7302 |
| 9 | A14 | A8 | A9 | 2.7302 |
| 10 | A36 | A27 | A22 | 2.6814 |
| 11 | A9 | A38 | A8 | 2.4945 |
| 12 | A33 | A39 | A3 | 2.4945 |
| 13 | A6 | A20 | A4 | 2.4945 |
| 14 | A7 | A7 | A5 | 2.3984 |
| 15 | A23 | A6 | A6 | 2.3984 |
| 16 | A3 | A11 | A7 | 2.3269 |
| 17 | A5 | A9 | A21 | 2.3269 |
| 18 | A39 | A36 | A20 | 2.2745 |
| 19 | A29 | A18 | A2 | 1.9183 |
| 20 | A18 | A3 | A31 | 1.9183 |
Clustering variable.
| Code | Variable | Definition |
|---|---|---|
| A3 | deathConfirmed | Number of confirmed deaths |
| A6 | Hospitalized | Number of hospitalizations |
| A7 | hospitalizedCumulative | Cumulative hospitalizations |
| A8 | hospitalizedCurrently | Number of people currently hospitalized |
| A9 | hospitalizedIncrease | New hospitalizations |
| A18 | onVentilatorCurrently | Number of respirators currently in use |
| A19 | positive | Number of confirmed cases |
| A21 | positiveIncrease | The number of new diagnoses |
| A31 | totalTestResults | Total number of tests |
| A39 | totalTestsViralIncrease | Number of new PCR tests |
Clustering results.
| Method | FS + K-Means Clustering | FS + Cascade K-Means Clustering | ||||
|---|---|---|---|---|---|---|
| Group | Cluster1 | Cluster2 | Cluster3 | Cluster3 | Cluster1 | Cluster2 |
| Count | 5 states | 5 states | 41 states | 7 states | 22 states | 22 states |
| cluster member | LA | IL | AK, AL, AR, AZ, CO, CT, DC, DE, FL, GA, GU, HI, IA, ID, IN, | IL | AL, AR, AZ, CO, FL, GA, | AK, CT, |
ANOVA analysis results with K-means clustering.
| Source | Sum of Squares | df | |
|---|---|---|---|
| deathConfirmed | 96,309,899.004 | 2 | 0.000 *** |
| hospitalized | 163,810,675.372 | 2 | 0.595 |
| hospitalizedCumulative | 163,810,675.372 | 2 | 0.595 |
| hospitalizedCurrently | 20,454,751.196 | 2 | 0.000 *** |
| hospitalizedIncrease | 10,452.301 | 2 | 0.558 |
| onVentilatorCurrently | 171,650.041 | 2 | 0.001 *** |
| positive | 297,933,527,997.266 | 2 | 0.000 *** |
| positiveIncrease | 33,214,641.966 | 2 | 0.000 *** |
| totalTestResults | 56,194,090,472,628.98 | 2 | 0.000 *** |
| totalTestsViral | 73,640,463,254,532.75 | 2 | 0.000 *** |
*** p < 0.001.
ANOVA analysis results with Cascade K-means clustering.
| Source | Sum of Squares | df | |
|---|---|---|---|
| deathConfirmed | 92,467,765.941 | 2 | 0.000 *** |
| hospitalized | 2,542,735,930.524 | 2 | 0.000 *** |
| hospitalizedCumulative | 2,542,735,930.524 | 2 | 0.000 *** |
| hospitalizedCurrently | 28,731,039.605 | 2 | 0.000 *** |
| hospitalizedIncrease | 146,195.267 | 2 | 0.000 *** |
| onVentilatorCurrently | 182,801.937 | 2 | 0.000 *** |
| positive | 428,233,021,802.748 | 2 | 0.000 *** |
| positiveIncrease | 47,433,754.861 | 2 | 0.000 *** |
| totalTestResults | 62,516,522,292,791.39 | 2 | 0.000 *** |
| totalTestsViral | 53,777,136,225,256.51 | 2 | 0.000 *** |
*** p < 0.001.
Random forest classification validation for K-means clustering.
| Confusion Matrix | Clustering Class | ||||
|---|---|---|---|---|---|
| Cluster1 | Cluster2 | Cluster3 | |||
| a | b | c | |||
| Prediction Class | Cluster1 | a |
| 0 | 5 |
| Cluster2 | b | 0 |
| 4 | |
| Cluster3 | c | 0 | 0 |
| |
Random forest classification validation for Cascade K-means clustering.
| Confusion Matrix | Clustering Class | ||||
|---|---|---|---|---|---|
| Cluster1 | Cluster2 | Cluster3 | |||
| a | b | c | |||
| Prediction Class | Cluster1 | a |
| 1 | 0 |
| Cluster2 | b | 0 |
| 0 | |
| Cluster3 | c | 0 | 0 |
| |
Neural network classification validation for K-means clustering.
| Confusion Matrix | Clustering Class | ||||
|---|---|---|---|---|---|
| Cluster1 | Cluster2 | Cluster3 | |||
| a | b | c | |||
| Prediction Class | Cluster1 | a |
| 0 | 5 |
| Cluster2 | b | 2 |
| 1 | |
| Cluster3 | c | 0 | 2 |
| |
Neural network classification validation for Cascade K-means clustering.
| Confusion Matrix | Clustering Class | ||||
|---|---|---|---|---|---|
| Cluster1 | Cluster2 | Cluster3 | |||
| a | b | c | |||
| Prediction Class | Cluster1 | a |
| 3 | 0 |
| Cluster2 | b | 0 |
| 0 | |
| Cluster3 | c | 3 | 0 |
| |
Comprehensive comparison of two clustering methods.
| Clustering | K-Means | Cascade K-Means | ||
|---|---|---|---|---|
| Validation | RF | NN | RF | NN |
| Accuracy | 0.8235 | 0.8039 | 0.9803 | 0.8823 |
| Precision | 1 | 0.911 | 1 | 0.938 |
| Recall | 0.824 | 0.872 | 0.98 | 0.938 |
Figure 4Cluster characteristics.
Cluster labeling.
| Severity | States |
|---|---|
| Low | AK, CT, DC, DE, GU, HI, IA, ID, KS, ME, MT, ND, NE, NH, NV, OR, PR, RI, SD, VT, WA, WY |
| Medium | AL, AR, AZ, CO, FL, GA, IN, KY, MA, MD, MN, MS, NJ, NM, NY, OH, OK, SC, TN, UT, VA, WI |
| High | IL, LA, MI, MO, NC, PA, TX |
Figure 5Clustering results from Cascade K-means.
Figure 6Influenza activity according to the CDC.