Literature DB >> 35991817

Assessment of the dissimilarities of totally 186 countries and regions according to COVID-19 indicators at the end of March 2020.

Handan Ankarali¹, Unal Uslu², Seyit Ankarali³, Sengul Cangur⁴.

Abstract

Background: This study is aimed at evaluating the relationship between the number of days elapsed since a country's first case(s) of coronavirus disease 2019 (COVID-19), the total number of tests conducted, and outbreak indicators such as the total numbers of cases, deaths, and patients who recovered. The study compares COVID-19 indicators among countries and clusters them according to similarities in the indicators.
Methods: Descriptive statistics of the indicators were computed and the results were presented in figures and tables. A fuzzy c-means clustering algorithm was used to cluster/group the countries according to the similarities in the total numbers of patients who recovered, deaths, and active cases.
Results: The highest numbers of COVID-19 cases were found in Gibraltar, Spain, Switzerland, Liechtenstein and Italy were also of that order with about 1500 cases per million population. Spain and Italy had the highest total number of deaths, which were about 140 and 165 per million population, respectively. In Japan, where exposure to the causative virus was longer than in most other countries, the total number of deaths per million population was less than 0.5. According to cluster analysis, the total numbers of deaths, patients who recovered, and active cases were higher in Western countries, especially in central and southern European countries, which had the highest numbers when compared with other countries.
Conclusion: There may be various reasons for the differences between the clusters obtained by fuzzy c-means clustering. These include quarantine measures, climatic conditions, economic levels, health policies, and the duration of the fight against the outbreak.

Entities: Chemical

Keywords: COVID-19; clustering; outbreak; total number of cases; total number of deaths

Mesh：

Year: 2022 PMID： 35991817 PMCID： PMC9356527 DOI： 10.4314/mmj.v34i2.2

Source DB: PubMed Journal: Malawi Med J ISSN： 1995-7262 Impact factor: 1.413

Introduction

Coronavirus disease 2019 (COVID-19) is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a zoonotic that crossed species to infect human populations and was identified first in Wuhan, China. As for severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), both of which are human respiratory syndromes, the virus causing COVID-19 also often causes severe respiratory symptoms that can be fatal. The World Health Organization (WHO) first determined that the global risk of a COVID-19 pandemic was “very high” on 28 February 2020, subsequently declaring the outbreak to be a pandemic on 11 March 20201. At that time, COVID-19 had been detected in 81 countries, with 57 countries registering 10 cases or fewer. Around 12 March 2020, the centre of the pandemic moved from China to Europe; subsequently, the number of countries exposed to COVID-19 reached 186 by the end of March2. Because the outbreak has affected the world in many respects, a summary of the current situation is of particular importance, and identification of the similarities and differences between countries in terms of the measures being taken is crucial. The first objective of this study is to define the relationships between the outbreak indicators of 34 countries that had reported the total number of tests conducted by the end of March 2020 and the duration of the fight against the outbreak. The second objective is to cluster totally 186 countries and regions countries according to the outbreak indicators (i.e. the total number of patients who recovered per million population, the total number of deaths per million population, and the total number of active cases per million population) to make it easier to track the outbreak and to evaluate countries' policies related to the pandemic.

Methods

Study population

In this study, the data for the 34 countries that had reported the total number of tests conducted by the end of March 2020 were used for the first objective. The outbreak indicators obtained from totally 186 countries and regions and also two ships were analysed for the second objective.

Study design and data collection technique

This was a cross-sectional study. Data on the total number of tests conducted and the total number of cases in 33 countries were collected between 17 and 20 March 2020. In addition, data from Turkey were collected on 26 March 2020. The data were analysed according to the following indicators: Total number of cases per total number of tests (%): Total number of cases per million population: The population size and outbreak indicators, which included the confirmed cases, patients who recovered, deaths, and active cases per day in each country, were mined from open-access public databases on 29 March 20203–5. The ratio of the total number of cases to the total number of tests performed indicates how many people had positive results per 100 tests. In addition, the other indicators used were as follows: Daily number of new cases Total number of deaths Total number of patients who recovered Total number of active cases Total number of critical cases The total number of cases was defined as the total number of deaths plus the total number of patients who recovered plus the total number of active cases. The number of days elapsed between the date of the first reported case and 29 March 2020 was taken into account when we compared countries in terms of outbreak indicators. These days were then divided into ten periods at appropriate intervals, and the effects of these periods on the indicators were re-evaluated from a different perspective. The periods were as follows: 31 December 2019 to 15 January 2020 16–31 January 2020 1–7 February 2020 8–15 February 2020 16–20 February 2020 21–29 February 2020 1–7 March 2020 8–15 March 2020 16–21 March 2020 22–29 March 2020

Eligibility criteria

The countries selected for evaluation of the first objective are those that had reported the total number of tests conducted by the end of March 2020. Data from all countries reporting outbreak indicators published by the end of March 2020 were used to evaluate the second objective.

Ethical considerations

All the data were obtained from open-access public databases; these were Worldometer, the WHO database, and the Johns Hopkins University & Medicine Coronavirus Resource Center database3–5. Therefore, ethical approval was not required.

Statistical analysis

The descriptive values, the median value, the 25th and 75th quartiles, the mode, and the minimum and maximum of the outbreak indicators from the countries with outbreaks in the given periods were calculated. All figures were drawn with use of the program Datawrapper6 for the first objective. The Kruskal-Wallis test followed by the Dunn post hoc test was used for comparison of the ten periods for the four outbreak indicators in Figures 5–8. The fuzzy c-means (FCM) clustering algorithm was used to cluster the countries by use of the total number of deaths, total number of patients who recovered, and total number of active cases per million population. All statistical analyses were done with IBM SPSS Statistics for Windows version 25.0 (IBM SPSS, Armonk, NY, USA)7 and JASP 0.11 (JASP Team, Amsterdam, Netherlands)8.

Figure 5

Total number of active cases per million population according to the period.

Figure 8

Number of critical cases per million population according to the period.

Total number of active cases per million population according to the period. Total number of deaths per million population according to the period. Number of new cases per million population according to the period Number of critical cases per million population according to the period.

FCM clustering

Clustering or cluster analysis is an unsupervised data analysis that is used to partition a set of records or objects into clusters with similar characteristics. Clusters are identified via similarity measures. Clustering involves assigning data points to clusters so that items in the same cluster are as similar as possible, while items belonging to different clusters are as dissimilar as possible. It is a desideratum that the within-cluster variance should be low and the between-cluster variance should be high in clustering. Fuzzy clustering (also referred to as “soft clustering” or “soft k-means”) is a form of clustering in which each data point can belong to more than one cluster9. Because some countries may be similar to more than one other country in terms of outbreak indicators, fuzzy clustering rather than hard clustering is a more appropriate algorithm. The FCM clustering algorithm is the most widely used partition-based clustering algorithm. FCM clustering with an automatically determined number of clusters could enhance the detection accuracy; it uses the Euclidian distance measure10. The FCM clustering algorithm gives the best results for overlapped datasets and is comparatively better than k-means and hierarchical clustering algorithms11. The algorithm is an iterative clustering method that produces an optimal c partition by minimizing the weighted within-group sum of squared error objective function J11. A set of cluster validity indices is used to estimate the number of clusters in a set of datasets partitioned by several algorithms. R2, the Akaike information criterion, the Bayesian information criterion, the within-cluster sum of squares, the Dunn index, the Calinski-Harabasz index, and Silhouette score are used for the validation of the results obtained by the FCM clustering algorithm. These indices are based on internal cluster validity indices. There are a few well-known measures, such as the Silhouette score, the Davies-Bouldin index, the Calinski-Harabasz index, and the Dunn index12, but these are not enough alone for determining the cluster quality and also the very notion of “good clustering” is a relative concept, based on the point of view and the knowledge of the analyser. The Dunn index is a ratio-type index where the cohesion is estimated by the nearest-neighbour distance and the separation is estimated by the maximum cluster diameter. Algorithms that produce clusters with a high Dunn index are more desirable. The Calinski-Harabasz index is the ratio of the sum of between-cluster dispersion and intercluster dispersion for all clusters; the higher the score, the better the performance. The Silhouette score measures the distance between each data point, the centroid of the cluster it was assigned to, and the closest centroid belonging to another cluster. This index is normalized, and a value close to 1 is always good for whatever clustering one is trying to evaluate. The score is bounded between −1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.

Results

The relationships between the outbreak indicators and the total number of tests and the duration of the fight against the outbreak

Figure 1 illustrates the relationship between the total number of cases and the total number of tests for the 34 countries that had reported tests performed between 17 and 20 March 2020. In addition, the total number of cases per total number of tests is plotted against the number of days elapsed between the first reported cases and 29 March 2020 for each country in Figure 2; see also Table 1.

Figure 1

Relationship between the total number of tests reported and the total number of cases in different countries

Figure 2

Total number of cases per million population versus the number of days after the first case(s) in each country.

Table 1

Total number of cases per total number of tests in the countries studied

Relationship between the total number of tests reported and the total number of cases in different countries Total number of cases per million population versus the number of days after the first case(s) in each country. Total number of cases per total number of tests in the countries studied The results show that Australia, Russia, Bahrain, Poland, South Africa, South Korea, Taiwan, Vietnam, Hungary, and Thailand performed the highest number of tests per million population and had the lowest number of positive test results (<3%). In eight countries, the rate of positive cases per total number of tests is higher than 10%. Among these countries, Spain, Pakistan, and Italy have the highest proportions. Figure 3 shows the total number of cases against the number of days after the first case(s).

Figure 3

Total number of cases per million population versus the number of days after the first case(s).

Total number of cases per million population versus the number of days after the first case(s). Gibraltar, Spain, Switzerland, Liechtenstein and Italy had the highest number, about 1500 cases per million population. The number of days elapsed between the first reported cases and 29 March 2020 was 59 in Spain and Italy and 33 in Switzerland. In many countries, however, the number of cases was less than 100 per million population. The relationships between the total death per million against the number of days after the first case(s) were shown in Figure 4. Spain and Italy had the highest total number of deaths, which were about 140 and 165 per million population, respectively.

Figure 4

Total number of deaths per million population versus the number of days after the first case(s).

Total number of deaths per million population versus the number of days after the first case(s). On the basis of when the first positive cases were reported in many countries, the number of days elapsed since the outbreak was divided into ten periods, as described in the methods section. Changes in the outbreak indicators in each country according to these periods are presented in Figures 5–10. Specific countries in each period are presented in Tables 2 and 3. The period in which maximum exposure occurred was 8–15 March 2020; 54% of countries saw their first case(s) before 8 March, and approximately 20% of first case(s) occurred after 15 March.

Figure 10

Total number of patients who recovered per million population according to the period

Table 2

Periods and the number of countries and ships

Table 3

Countries and ships in the periods determined by considering exposure times

Total number of cases per million population according to the period Total number of patients who recovered per million population according to the period Periods and the number of countries and ships Countries and ships in the periods determined by considering exposure times Exposure periods are listed on the x-axis in Figures 5–10, from the longest exposure (period 1) to the shortest exposure (period 10). In the periods covering 31 December to 15 January, 8–15 February, and 1–29 March (periods 1, 4, 8, 9, and 10), the median number of active cases per million population was significantly lower than for the other periods (Figure 5 and Table 4; P<0.001). Other than that, no significant difference was found. From Figure 5 it can be seen that the highest number of active cases among the countries in the second period was in Italy, and in the sixth period the highest numbers were in Luxembourg and Switzerland.

Table 4

Descriptive values of the indicators in the ten periods according to the first case(s)

Descriptive values of the indicators in the ten periods according to the first case(s) In Japan, where exposure to the virus was for longer than in most other countries, the total number of deaths is very low (less than 0.5 per million population). The median number of deaths was significantly higher in periods 2, 3, and 6 (16–31 January, 1–7 February, and 21–29 February) than in the other periods (Figure 6 and Table 4; P<0.001). Italy and Spain have the highest numbers in period 2 and the Netherlands has the highest number in period 6.

Figure 6

Total number of deaths per million population according to the period.

The number of new cases in the countries that have been exposed the longest is quite low. The median number of new cases was significantly lower in periods 1, 4, 8, and 10 (31 December to 15 January, 8–15 February, 8–15 March, and 22–29 March). This was followed by the 16–21 March period, with a significantly higher number of cases than for the other periods (Figure 7 and Table 4; P<0.001). Figure 7 shows that among the countries that experienced outbreaks in the 16–31 January period (period 2), Spain, the UK, and Sweden have a significantly higher number of new cases than the other countries. In addition, among the countries exposed in the 1–7 February period, the number of new cases is highest in Belgium.

Figure 7

Number of new cases per million population according to the period

The number of critical cases is quite low in the countries that have been exposed the longest.

Discussion

The earliest countries to report COVID-19 cases after the outbreak in China were South Korea and Taiwan, but these countries have contained the outbreak with some success13. The rapid spread of COVID-19 has led many countries around the world to implement strict measures, and serious problems have started to emerge. To follow the course of the outbreak and to minimize problems, it is of great importance that accurate methods of data analysis should be used. In addition, many indicators and country-specific characteristics should be taken into consideration when one is comparing data from different countries14. There are many open-access databases comprising shared data relating to COVID-19 cases that can be used for this purpose3–5. In this study, two objectives were achieved. Firstly, the relationship between outbreak indicators (total number of cases, total number of deaths, and total number of patients who recovered) and the number of days after the index case, and also the total number of tests, was clarified. From Figures 2–4, it can be seen that, on the basis of the total number of tests conducted in Italy and Spain, the number of positive cases and the total number of deaths are very high. These numbers have negatively affected the responsiveness of the health systems in those countries. The health systems in Italy, Spain, Belgium, France, the Netherlands, and Iran displayed capacity difficulties. Despite the high numbers of COVID-19 cases in Germany, Austria, Switzerland, and the USA, the health systems in these countries are currently able to respond. Figure 7 leads us to the conclusion that quarantine conditions are not followed adequately in countries with a high number of new cases. Furthermore, it can be seen that the spread of the virus slowed down in period 1 countries, where the virus first spread, whereas the effects of the outbreak in period 2 countries will continue on the current course (Table 3). However, it can also be observed that for countries in periods 6 and 7, the health systems that are struggling to cope with the numbers of COVID-19 patients are likely to see increased numbers of deaths. Various factors such as demographic structure, geographical structure, economic level, climatic conditions, and measures taken can be affected the pandemic results of the countries. In a study by Violini15, the importance of exposure times is emphasized in a comparison of 23 countries also. For this reason, the duration of exposure to infection was taken into account in this study as it can affect country differences. The WHO guidelines explained that the pre-epidemic preparations of countries and physician knowledge and skills also affected the rate of positive cases16. Secondly, the similarities of countries in terms of outbreak indicators were examined by a multivariate method. Figure 14 summarizes the similarities and differences between the countries studied at the end of March 2020 in terms of the total number of deaths, the total number of patients who recovered, and the total number of active cases. Those with characteristics different from the characteristics of other countries in terms of the effects of the pandemic are generally located in separate clusters. This study determined that the total number of deaths is higher in central and southern European countries, especially Italy, Spain, Switzerland, and Portugal. However, the number of patients who recovered in these countries is also high. Additionally, it was found that the number of active cases is higher in South America, East Asia, and northern European countries such as Italy, Spain, and Switzerland. According to the results of the cluster analysis, countries can make better decisions about the measures to be taken by investigating the reasons for the intra-cluster and inter-cluster differences found. Visualization of fuzzy clustering results by Sammon mapping. The clustering of countries according to various indicators is discussed in some studies17,18. In the k-means cluster analysis conducted by Zoumpekas17, the total number of cases by country, the daily number of deaths, and the daily number of patients who recovered were considered. For each indicator, data presented in separate time series were used. Kumar18 performed a hierarchical cluster analysis to classify Indian states and union territories on the basis of COVID-19 status. He found that it grouped 27 states and five union territories into six clusters. He found that optimization of monitoring techniques is required to improve government policies and decisions, medical facilities, treatment, etc. to reduce the number of people who die. Ploner19 performed two different HDBSCAN cluster analyses. The first included only three features and worked well for countries having only 2.5 weeks of data after the outbreak. In comparison, the second analysis used features from the peak of the curve. For countries with increased numbers of daily cases, the peak moved and, therefore, the results changed. Approximately 60 countries were considered 60. Ploner19 found higher mortality in Spain, Italy, Belgium, New York, Germany, and Canada than in other countries. Zarikas et al.20 presented a novel analysis resulting in the clustering of countries according to active cases, active cases per population, and active cases per population and per area based on Johns Hopkins epidemiological data. They found that after removing Monaco and San Marino, a cluster including Liechtenstein and Andorra and one with Malta and Luxembourg were obtained, while all other countries remained together.

Conclusion

To define and track the progress of the pandemic and its effects, similarities between countries can be examined by considering indicators together. Therefore, better decisions can be made using multivariate analysis techniques such as cluster analysis21, which is an extremely useful method for finding new relationships and insights19. In the event that the pandemic continues, this work offers a basic study that evaluates the measures taken by countries in the periods following outbreaks. In addition, the results of this study will benefit researchers by offering a guide for how to design more comprehensive research. It can be misleading to compare countries one by one in terms of each indicator. In this study, country similarities were investigated by our considering the relationships between outbreak indicators. In conclusion, various features of countries, such as climatic conditions, cultural habits, average age, chronic disease frequency, the epidemic measures taken, and epidemic indicator results, can be related to each other. For this reason, it is recommended to perform data analysis with multivariate models such as cluster analysis, which takes into account the relationships between these features in studies that examine countries comparatively.

Limitations

By the end of March 2020, only 34 of 169 countries, 17 regions and 2 ships struggling with the pandemic had reported the total number of tests. The results can give limited information to show the relationship between outbreak indicators and the total number of tests. Besides, three outbreak indicators were used in the clustering of countries according to their similarities in this study. On the other hand, in addition to the outbreak indicators, more accurate predictions can be made once the similarities of countries are investigated together with many features, such as pandemic measures, economic levels, climatic conditions, and demographic structures.

Table 5

Internal validity criteria and performance of the results

Table 6

Cluster information

Table 7

Clusters obtained from the fuzzy c-means algorithm

Table 8

Median values of the indicators in each cluster

3 in total