| Literature DB >> 34972836 |
Alfonso Monaco1, Ester Pantaleo1,2,3, Nicola Amoroso1,4, Loredana Bellantuono1,2, Alessandro Stella5, Roberto Bellotti1,3.
Abstract
The identification of factors associated to COVID-19 mortality is important to design effective containment measures and safeguard at-risk categories. In the last year, several investigations have tried to ascertain key features to predict the COVID-19 mortality tolls in relation to country-specific dynamics and population structure. Most studies focused on the first wave of the COVID-19 pandemic observed in the first half of 2020. Numerous studies have reported significant associations between COVID-19 mortality and relevant variables, for instance obesity, healthcare system indicators such as hospital beds density, and bacillus Calmette-Guerin immunization. In this work, we investigated the role of ABO/Rh blood groups at three different stages of the pandemic while accounting for demographic, economic, and health system related confounding factors. Using a machine learning approach, we found that the "B+" blood group frequency is an important factor at all stages of the pandemic, confirming previous findings that blood groups are linked to COVID-19 severity and fatal outcome.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34972836 PMCID: PMC8720090 DOI: 10.1038/s41598-021-04162-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
List of input features.
| Genetic features | Demographic indicators | Medical indicators | Economic indicators | Life style indicators |
|---|---|---|---|---|
O+ A+ B+ AB+ O− A− B− AB− Rh−/Rh+ | Population density | Life expectancy at birth | GDP per capita | Percentage of female smokers |
| Median age of the population | Cardiovascular death rate | Total healthcare expenditure | Percentage of male smokers | |
| Population aged 65 or older | Diabetes prevalence | Hospital beds per thousand inhabitants | ||
| Population aged 70 or older |
We used five kinds of features: genetic, demographic, economic, medical, and life style indicators.
List of input countries.
| Countries | ||
|---|---|---|
| Austria | Belgium | Bosnia and Herzegovina |
| Bulgaria | Cyprus | Croatia |
| Czechia | Denmark | Estonia |
| Finland | France | Germany |
| Greece | Hungary | Iceland |
| Ireland | Italy | Lithuania |
| Luxembourg | Malta | Moldova |
| Montenegro | Netherlands | Norway |
| Poland | Portugal | Romania |
| Russia | Serbia | Slovakia |
| Slovenia | Spain | Sweden |
| Ukraine | United Kingdom | Armenia |
| Bangladesh | Bahrain | China |
| India | Indonesia | Iran |
| Israel | Japan | Lebanon |
| Myanmar | Malaysia | Nepal |
| Philippines | Singapore | Saudi Arabia |
| South Korea | Thailand | Turkey |
| United Arab Emirates | Yemen | Ethiopia |
| Ghana | Kenya | Mauritius |
| Morocco | South Africa | Zimbabwe |
| Canada | Costa Rica | Dominican Republic |
| Jamaica | Mexico | United States |
| Brazil | Chile | Colombia |
| Ecuador | New Zealand | Australia |
Figure 1Flowchart of the proposed methodology. We fed the learning algorithm with selected features to forecast the TDPM. Microsoft Power Point was used to generate the figure.
Figure 2Correlation matrix of all variables (both dependent and independent). As expected “Median age of the population”, “Population aged 65 or older”, and “Population aged 70 or older” have mutual correlation close to 1; also the genetic features are highly correlated with each other. The TDPMs at the three different dates are also highly correlated with each other, as expected. Notably, the “B+” predictor has the highest (and negative) linear correlation with the TDPM at all three dates. R package corrplot 0.90 was used to generate the figure.
Figure 3Boxplot of the distribution of the Boruta importance measure for input variables with median higher than variable “Shadow Max”. The distribution was obtained from 500 runs of the algorithm on the complete set of features using June, September, and December 2020 TDPM data. Using as cut-off the upper quartile of “Shadow Max”, we colored in yellow excluded variables and in green variables selected for further analysis. R base package graphics 4.0.5 was used to generate the figure.
Given the random nature of the Boruta algorithm we performed 100 runs of this algorithm on the same dataset with different seeds, then counted how many times each feature was selected by Boruta and reported counts in this table. Column “Type” has value “g” and “n” for “genetic” and “non genetic” features, respectively.
| Name | Type | Percentage of times selected in | ||
|---|---|---|---|---|
| June | September | December | ||
| B+ | g | 100 | 100 | 100 |
| Diabetes prevalence | n | 100 | 0 | 0 |
| Cardiovascular death rate | n | 98 | 20 | 0 |
| O− | g | 93 | 4 | 100 |
| AB+ | g | 91 | 100 | 0 |
| Rh−/Rh+ | g | 42 | 1 | 100 |
| A− | g | 25 | 49 | 100 |
| Total healthcare expenditure | n | 4 | 0 | 0 |
| O+ | g | 0 | 2 | 0 |
| Percentage of female smokers | n | 0 | 0 | 100 |
| Population density | n | 0 | 0 | 100 |
| B− | g | 0 | 0 | 75 |
| A+ | g | 0 | 0 | 1 |
| GDP per capita | n | 0 | 0 | 0 |
| Hospital beds per thousand | n | 0 | 0 | 0 |
| Life expectancy at birth | n | 0 | 0 | 0 |
| Median age | n | 0 | 0 | 0 |
| Aged 65 or older | n | 0 | 0 | 0 |
| Aged 70 or older | n | 0 | 0 | 0 |
| Percentage of male smokers | n | 0 | 0 | 0 |
| AB− | g | 0 | 0 | 0 |
| g | 0 | 0 | 0 | |
Performance measures of the RF regression model at each selected time point, using the selected Boruta features and averaged over 5 runs of cross validation (with the respective standard deviations).
| Time point | RMSE | MAE | |
|---|---|---|---|
| June | 0.47 ± 0.13 | 135 ± 10 | 85 ± 11 |
| September | 0.25 ± 0.19 | 192 ± 37 | 129 ± 24 |
| December | 0.34 ± 0.04 | 312 ± 48 | 241 ± 39 |
Figure 4Average importance of the variables used in the RF model over 100 runs of the RF algorithm, with the respective standard deviations. R base package graphics 4.0.5 was used to generate the figure.
Performance metrics of a linear multivariate model applied to the set of features selected by Boruta at each time point, using all countries, averaged over 5 runs of cross validation (with the respective standard deviations). The last column reports only significant features. Significance codes: ‘***’ 0.001, ‘**’ 0.01, and ‘*’ 0.05. The multivariate linear model found feature “B+” to be significant at all three time points, and also found “Cardiovascular death rate” to be significant but only in June. The significance of these features is higher in June and lower but similar in September and December, however most of the linearity is explained by the intercept of the linear model.
| Time point | RMSE | MAE | significant features | |
|---|---|---|---|---|
| June | 0.31 ± 0.10*** | 138 ± 49 | 105 ± 36 | B+ **, Cardiovascular death rate ** |
| September | 0.32 ± 0.15*** | 184 ± 36 | 149 ± 34 | B+ ** |
| December | 0.29 ± 0.17*** | 329 ± 64 | 260 ± 44 | B+ * |
Figure 5Map of the TDPM in June, September, and December 2020 on the left. Maps of some of the input features on the right. Countries not included in the analysis are colored in gray. R package Rworldmap 1.36 was used to generate the maps.