| Literature DB >> 33036646 |
Christopher Toh1, James P Brody2.
Abstract
INTRODUCTION: The course of COVID-19 varies from asymptomatic to severe in patients. The basis for this range in symptoms is unknown. One possibility is that genetic variation is partly responsible for the highly variable response. We evaluated how well a genetic risk score based on chromosomal-scale length variation and machine learning classification algorithms could predict severity of response to SARS-CoV-2 infection.Entities:
Keywords: COVID-19; Genetic risk score; Machine learning; UK biobank
Mesh:
Year: 2020 PMID: 33036646 PMCID: PMC7546598 DOI: 10.1186/s40246-020-00288-y
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
We segmented the dataset into three overlapping subsets. The first, which we called “1930” contained all UK Biobank participants born after 1930 who had a severe reaction to SARS-CoV-2 infection before 27 April 2020. The two subsets contained people born after 1940 and after 1950
| Dataset | Number |
|---|---|
| 1930 (< 90 years of age) | 981 |
| 1940 (< 80 years of age) | 880 |
| 1950 (< 70 years of age) | 468 |
Fig. 1This boxplot figure presents the results of the machine learning predictions. We created three different datasets, one which includes all patients less than 90 years old, the second includes every patient less than 80 years old, and the third with every patient less than 70 years old. These are indicated as the oldest birthyear “data.” Each dataset included an equal number of patients with a “severe reaction” to COVID-19 and an equal number of age-matched people drawn from the general UK Biobank population, “normal.” For comparison, we took those three datasets and randomly permuted the status (“severe reaction” or “normal”) and repeated the process. This randomly permuted dataset is labeled oldest birthyear “random.” For each dataset, we repeated the whole process 100 times, each time with a different set of age-matched people from the general UK Biobank population
We compared the difference in mean AUC values between the various datasets using a t test. The datasets consisting of people born after 1930, 1940, and 1950 all showed significant differences with the corresponding random control. Those three datasets also showed significant differences between the mean AUC and 0.5. The three random controls did not show a significant difference between the mean AUC and 0.5, as expected. An AUC value of 0.5 represents a random classification test, one in which the algorithm is no better than guessing
| 1930 data | 1930 random | 2 · 10−11 |
| 1940 data | 1940 random | 1 · 10−9 |
| 1950 data | 1950 random | 1 · 10−4 |
| 0.5 | 1930 data | 3 · 10−14 |
| 0.5 | 1940 data | 4 · 10−13 |
| 0.5 | 1950 data | 3 · 10−4 |
| 0.5 | 1930 random | 0.1 |
| 0.5 | 1940 random | 0.4 |
| 0.5 | 1950 random | 0.08 |
The mean and standard deviation of the area under the curve of the receiver operating characteristic curve was recorded after each of the 100 different XGBoost classification models. Each run used a different set of people who did not have a severe reaction to COVID-19. The mean AUC for all three datasets was well described by a normal distribution, as confirmed by a Shapiro normality test
| Mean AUC | SD AUC | |
|---|---|---|
| 1930 data | 0.515 | 0.017 |
| 1940 data | 0.516 | 0.019 |
| 1950 data | 0.511 | 0.030 |