Literature DB >> 34828448

Poking COVID-19: Insights on Genomic Constraints among Immune-Related Genes between Qatari and Italian Populations.

Hamdi Mbarek¹, Massimiliano Cocca², Yasser Al-Sarraj¹, Chadi Saad¹, Massimo Mezzavilla², Wadha AlMuftah¹, Dario Cocciadiferro³, Antonio Novelli³, Isabella Quinti⁴, Azza AlTawashi⁵, Salvino Salvaggio⁵, Asma AlThani¹, Giuseppe Novelli⁶, Said I Ismail¹.

Abstract

Host genomic information, specifically genomic variations, may characterize susceptibility to disease and identify people with a higher risk of harm, leading to better targeting of care and vaccination. Italy was the epicentre for the spread of COVID-19 in Europe, the first country to go into a national lockdown and has one of the highest COVID-19 associated mortality rates. Qatar, on the other hand has a very low mortality rate. In this study, we compared whole-genome sequencing data of 14398 adults and Qatari-national to 925 Italian individuals. We also included in the comparison whole-exome sequence data from 189 Italian laboratory-confirmed COVID-19 cases. We focused our study on a curated list of 3619 candidate genes involved in innate immunity and host-pathogen interaction. Two population-gene metric scores, the Delta Singleton-Cohort variant score (DSC) and Sum Singleton-Cohort variant score (SSC), were applied to estimate the presence of selective constraints in the Qatari population and in the Italian cohorts. Results based on DSC and SSC metrics demonstrated a different selective pressure on three genes (MUC5AC, ABCA7, FLNA) between Qatari and Italian populations. This study highlighted the genetic differences between Qatari and Italian populations and identified a subset of genes involved in innate immunity and host-pathogen interaction.

Entities: Chemical

Keywords: COVID-19; COVID-19 severity; genetic constraints; population genetics

Mesh：

Year: 2021 PMID： 34828448 PMCID： PMC8623290 DOI： 10.3390/genes12111842

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.096

1. Introduction

COVID-19 continues to spread worldwide, with over four million deaths to date and rising. However, this global spread is coupled with stark anomalies in morbidity and mortality. These differences can be seen not only between different populations but also within the same population [1,2,3,4]. While most of these differences can be attributed to sociodemographic and clinical factors, this is also a unique opportunity to assess associations with host genomes. Host genomic information, specifically genomic variations, may characterize susceptibility to disease and identify people with a higher risk of harm, leading to better targeting of care and vaccination [5,6,7]. In addition, characterizing these host factors may help identifying and development of adapted drugs and vaccines [8,9,10]. The scientific community came together with several efforts to investigate how the genomic variation in the host affects disease susceptibility and progress [11,12]. So far, these large consortia efforts have led to the identification of over 20 loci associated with susceptibility or severity of the disease [13]. Italy was the epicentre for the spread of COVID-19 in Europe and the first country to go into a national lockdown. It had one of the highest COVID-19 associated mortality rates in Europe [4]. At the time of writing, almost five million cases have been confirmed, with a death toll of more than 130 thousand people (infection fatality rate = 2.7%). Qatar, on the other hand, despite having one of the highest worldwide numbers of laboratory-confirmed cases (36,729 cases per million, by July 2020), has a very low mortality rate (infection fatality rate = 0.91 per 10,000 persons, by July 2020, per WHO COVID-19 mortality classification) [14]. Studies have even suggested that some communities in Qatar have reached herd immunity for SARS-CoV-2 at a proportion of infection of 65–70% [15]. With the development of the first generation of RNA based vaccines [16,17], along with the more standard adenovirus-based solutions [18,19], the end of the pandemic seems to be in sight, though we are aware that this is just the beginning of a more bearable coexistence with the virus. Differences in terms of fatality rate or disease prevalence between population groups, besides the socio-economic factors, could also be imputed to the patients’ genetic background, and, as mentioned above, several studies are investigating the host genetic contribution to the disease susceptibility and severity [11,12,13]. In this study, we focused on genes involved in the immune response, combining them with a dataset of 1500 proteins mostly involved in COVID-19 disease [20] and a subset of genes already identified as linked to COVID-19 susceptibility and progression [7]. We applied, on this set of genes, a prioritization method based on ultra-rare and population-specific variants. With this approach, we aim to identify a group of genes showing different signs of selective pressure in our study cohorts. Our hypothesis is that those genes can provide information to understand the pandemic progression and maybe help towards therapy.

2. Materials and Methods

2.1. Population Description

The Qatari Cohort: The Qatar Genome Program (QGP) [21] is a population-based project launched by the Qatar Foundation to generate a large-scale whole-genome sequence (WGS) dataset, in combination with comprehensive phenotypic information collected by the Qatar Biobank (QBB) [22]. All subjects included in the analysis were of Qatari Middle Eastern Arabian ancestry [23]. In this study, we use a cohort of 14,398 individuals with an average coverage of 30X. Data preprocessing and downstream quality control analyses for WGS data were conducted as recommended by the Covid19 Host Genetics Initiative study protocol [7]. Italian Genetic Isolated cohorts: Three Italian cohorts belonging to the Italian network of Genetic Isolates (INGI) were involved in this study due to the availability of whole genome-sequence data. The selected populations localized in three different geographical areas of Italy: North-West (Val Borbera-VBI), North-East (Friuli Venezia Giulia-FVG) and South-East (Carlantino-CAR); In each cohort, a wide range of phenotypic data is available for each participant (e.g., anthropometric traits, blood tests, sensory impairment, taste and food preferences, extensive personal and familial anamnesis). A total of 925 samples with low coverage (4X to 10X) WGS data were selected for the analyses [24] Italian COVID-19 positive samples: a cohort of 189 individuals which tested positive for the SARS-CoV-2 infection and collected at the Bambino Gesu’ hospital in Rome was included in the study to provide information on the pattern of genetic variation in a group of selected genes in an outbred Italian cohort. Whole Exome Sequencing data was generated by the University of Tor Vergata from peripheral blood. The samples are clustered in three groups, based on the disease severity: severe, extremely severe and asymptomatic [25]. All data analyzed was aligned to the reference genome’s GRCh38 release, and functional annotations were obtained using the Ensembl VEP tool [26].

2.2. Principal Component Analysis

To highlight the study cohorts’ population structure level, we performed a principal component analysis (PCA) using KING software [27]. Plink v1.9 software [28] was used to convert data from vcf to plink binary format. QGP and each INGI cohort results were projected into the 1000Genomes Project data [29]. To highlight the peculiar ancestry structure of the Qatari population, we also performed an ancestry inference analysis using the software KING.

2.3. Genes Selection and Prioritization Analyses

Literature curation process, Genomics England (GEL) panel expert and Ingenuity Variant Analysis (IVA): The candidate gene generation process, the initial candidate gene list ranking and curation are conveyed on the recent literature review to extract a list of genes involved in innate immunity and host-pathogen interaction. The primary gene list is curated according to the knowledge-literature base by the Ingenuity® Variant Analysis™ software from QIAGEN [30] and the viral gene panel expert from Genomics England (GEL) [31]. This list includes a total of 3617 genes (Table S1). Candidate genes were annotated with the most common gene-ranking metrics using the loss of function intolerance score (pLI) [32] and the Residual Variation Intolerance Score (RVIS) [33]. In addition, we selected a list of 25 genes (Table S2) that were recently associated with COVID-19 susceptibility and severity [7] and overlapped with the primary list, extracting a subset of genes that underwent further analyses. Population-based Gene constraints: Two population-based gene metric scores, the Delta Singleton-Cohort variant score (DSC–accounting for the difference in singletons between coding and non-coding regions) and Sum Singleton-Cohort variant score (SSC—accounting for the sum of singletons variants in the coding and non-coding regions), were adopted to estimate the presence of specific pressures selection in the Qatari population as well as the Italian Isolated cohorts [34]. Only variants with a QUAL value above 30 were used in the calculation to limit the inclusion of genotyping error for variants with allele count (AC) equal to 1. Only scores calculated on canonical transcripts were selected. Only genes with scores values lower or equal to −2 and greater or equal to 2 were retained in each population. These values represent the significant threshold that allows us to discriminate between a gene under constraint (DSC or SSC score ≤−2) or under relaxation (DSC or SSC score ≥ 2). Since we aim to compare two populations with different structures and high levels of inbreeding, we also calculated the same set of scores for the closest ancestry populations of each study cohort. We used data from the gnomAD v3.1 [35] call set, including 1000Genomes project samples and extracted information on the EUR, AFR and SAS superpopulations subset. The EUR subset was used as reference for the Italian samples [36], while the AFR and SAS subsets for the Qatari cohort [23]. We defined two different levels of comparison: at the ancestry level, in which we selected all genes showing concordant selective signals between our study cohorts and their closest ancestry population, and at the population level, in which we selected only genes showing different behaviour between our target population and the reference. Since we were dealing with three Italian populations, comparing only with one reference, we selected genes that satisfied our criteria in at least one of the three target Italian populations. On the other hand, we used two different references for the Qatari target population, so we selected all genes that meet our criteria in at least one reference population. Finally, we proceeded with the comparisons between our target populations, performing three sets of comparisons: population-specific, population-specific vs. ancestry related and ancestry related comparisons (Table S3). Each comparison was performed separately for SSC and DSC scores. Finally, we generated a list of genes overlapping between SSC and DSC comparisons to select those genes that consistently showed opposite behaviour in terms of selection or relaxation in our target populations. We used the Fisher’s test to compare DSC and SSC scores distributions between study cohorts, and reference populations. We performed a Shapiro-Wilk test to assess the normality of the score distribution in each cohort and an enrichment test to assess whether there was an enrichment in relaxed or constrained genes in our target populations vs. the selected reference populations.

2.4. WES COVID-19 Cohorts

Using Whole Exome Sequence data from a cohort of COVID-19 positive samples (n = 189), we calculated singletons count and singletons density in the coding regions of the genes belonging to the shortlist generated, adjusting by sample size, and compared using Fisher’s test against data from the other target and reference populations. In this cohort, each sample was characterized by a disease severity code. The disease severity classes are defined as follows: (1) Asymptomatic/Paucisymptomatic, (2) Severe, (3) Critical/life-threatening [25]. We used this information to investigate if we could identify any contribution of the singleton burden of the prioritized genes to the classification. A multinomial analysis with R was performed using age, gender and the singleton count as explanatory variables. We also analyzed the contribution of the prioritized genes to the outcome (Survived/Deceased) with a logistic regression model and the same covariates used in the multinomial analysis. We performed the analyses using both the whole-gene singleton count and the coding regions singleton count. A summary of the phenotype information is available in Table S4.

3. Results

3.1. Population Stratification

As expected, the PCA analysis (Figure 1) showed a clear differentiation between QGP and INGI (European) ancestry. The Italian cohorts clustered with the European samples from the 1000Genomes Project reference data, while the QGP samples overlapped with clusters from different populations (AFR, SAS, AMR, EUR). Using the ancestry inference function provided by the KING software, we confirmed the presence of different sub-population clusters in the QGP cohort, highlighting that a considerable proportion of the analyzed samples (more than 4000 samples) belong to a ‘missing’ super population cluster (Figure S1). This is mainly due to the absence of population from the Near East in the 1000 Genomes Project data. This outcome confirms results already obtained by different studies on the Italian populations [36] and on the first subset of nearly 6000 samples of the Qatari population [37].

Figure 1

PCA plot of the QGP and INGI cohorts projected onto 1000Genomes Project data. As expected, the first two principal components already show the separation between the QGP and the INGI cohorts and the overlap with the selected populations for the ancestry-related comparisons.

3.2. Population Based Gene Prioritization

For each population and each gene in the selected subset, we calculated two scores related to the presence of cohort-singletons variants. In (Figure 2), we show the distribution, among the 3617 genes selected, of the DSC score ((Figure 2) top panel) and the SSC score ((Figure 2) bottom panel), in each study cohort and the selected reference groups (EUR, AFR and SAS). We compared each target population score distribution with the relevant reference population (Table S5). All the INGI populations are significantly different from the reference EUR population for both scores. In contrast, the Qatari population significantly differs from the reference populations (AFR and SAS) in DSC score distribution, but not for the SSC score distribution. This pattern is also confirmed in the enrichment analyses of relaxed and constrained genes, in target populations vs. reference populations, of relaxed and constrained genes. Exact Fisher’s tests show enrichment in constrained genes in all the target populations when comparing DSC scores (Table S6). Similarly, if we consider the SSC scores, all target populations do not show significant enrichment in constrained genes. Regarding the relaxed genes, though, we detected a significant enrichment in both DSC and SSC scores for the Italian populations (CAR, VBI and FVG) but not for the QGP cohort (Table S6).

Figure 2

Distributions of the prioritization scores. Violin plots of the distributions of DSC (top panel) and SSC (bottom panel) scores in the subset of selected genes for all target populations (CAR, FVG, VBI, QGP) and all reference outbred populations (AFR, EUR, SAS) from 1000Genomes project.

We used a threshold of −2 to define significant constraint and a threshold of +2 to define a significant relaxation signal [34]. Results for the comparisons between our target populations are summarized in Table 1 and Table 2. Regarding the DSC score, we identified six genes with a signature of constraint in the QGP population and relaxation in at least one of the Italian populations. Two of those genes (TTN and LRP1B) are results of population-specific comparisons, and one (RICTOR) is the outcome of an ancestry-related comparison. Eight genes showed an opposite pattern of relaxation in the Qatari cohort and constraint in at least one Italian cohort. Among them, RYR3 is the result of an ancestry related comparison. When comparing our target populations based on the overall burden of singletons in each gene (SSC score), we identified a total of 35 genes that behave differently between the Qatari population and at least one Italian population (Table 2). Seventeen of those genes showed a pattern of constraint in the Qatar population and relaxation in at least one Italian cohort. The HELZ gene was the only one arising from a population-specific comparison. The remaining eighteen genes showed a pattern of relaxation in the QGP dataset but a significant constraint in at least one of the other targets. In this subset, the CELSR2 gene is the result of a population-specific comparison. Since our focus is to identify genes that consistently show different selection signals among our cohorts, we selected a subset of genes for which both DSC and SSC scores are concordant: ABCA7, FLNA, MUC5AC (Table 3). Those three genes showed a consistent relaxation pattern in the QGP cohort while being always characterized by strong signals of constraints in at least one Italian cohort. Interestingly, FLNA shows a significant signal of constraint in CAR and VBI cohorts while remaining neutral in the FVG dataset. The trend for the constraint signal is also replicated in all the reference cohorts selected. Data from other outbred populations from the 1000Genome project (EAS and AMR) confirm the trend of constraints (Table S7). ABCA7 repeats the pattern observed for FLNA, in terms of target populations, with a significant constraint signal in the FVG cohort, and a trend of constraint in the CAR cohort, while being neutral in the VBI cohort. This time though, we can see how the outbred reference populations, plus the remaining super populations of 1000Genomes, are all in agreement, showing relaxation signals. Lastly, the MUC5AC gene shows a consistent pattern of significant constraint signal in all the Italian cohorts, but conversely, always a significantly relaxed pattern in all other populations.

Table 1

Results from comparison of DSC scores between target cohorts (CAR, FVG, VBI, QGP) and the relevant reference superpopulations from the 1000 Genomes Project (EUR, AFR, SAS). The last column refers to the nature of the comparison carried out, as detailed in Supplementary Table S3.

		DSC Score
Transcript ID	Gene Name	QGP	CAR	FVG	VBI	EUR	AFR	SAS	Comparison
ENST00000369850	FLNA	3.854	−2.435	0.272	−2.510	−2.399	−2.166	−1.879	C5
ENST00000350763	TNC	3.370	−3.792	1.388	−0.631	2.666	1.838	2.187	C4
ENST00000389048	ALK	2.575	3.651	0.290	−4.098	3.388	3.212	2.852	C4
ENST00000263094	ABCA7	2.566	−0.433	−2.168	0.071	3.020	2.681	2.150	C4
ENST00000647814	ABCC2	2.528	−3.466	0.467	3.004	2.562	2.877	2.508	C4
ENST00000621226	MUC5AC	2.435	−2.404	−2.017	−2.892	3.477	3.500	3.032	C4
ENST00000634891	RYR3	2.229	−3.377	−1.554	−3.586	−2.431	2.639	−3.449	C8
ENST00000542267	FBXL17	2.026	−1.232	0.180	−2.658	3.086	0.266	2.477	C4
ENST00000589042	TTN	−2.242	−2.595	3.411	4.584	−2.388	−1.965	3.498	C1
ENST00000357387	RICTOR	−2.369	−2.206	−0.033	2.407	2.181	1.284	−4.070	C7
ENST00000561890	MUC22	−2.472	−1.682	−1.632	2.156	−2.562	−2.191	−2.425	C3
ENST00000336596	EPHA3	−3.001	2.139	−3.421	−1.535	−3.704	1.921	−2.814	C3
ENST00000648947	INO80	−3.444	−1.309	2.424	−2.928	−3.439	−2.692	−0.727	C3
ENST00000389484	LRP1B	−4.888	−4.916	3.192	2.602	−4.245	−2.136	3.231	C1

Table 2

Results from comparison of SSC scores between target cohorts (CAR, FVG, VBI, QGP) and the relevant reference superpopulations from the 1000 Genomes Project (EUR, AFR, SAS). The last column refers to the nature of the comparison carried out, as detailed in Supplementary Table S3.

		SSC Score
Transcript ID	Gene Name	QGP	CAR	FVG	VBI	EUR	AFR	SAS	Comparison
ENST00000378473	PLCB4	−4.524	2.917	−3.907	−0.413	−4.272	−2.873	−3.079	C3
ENST00000366574	RYR2	−4.347	3.792	−4.694	−5.053	−3.704	−2.318	−2.087	C3
ENST00000315872	ROCK2	−3.680	3.822	−2.372	−1.149	−4.160	−3.248	−3.665	C3
ENST00000361445	MTOR	−3.371	−0.975	0.100	2.659	−3.378	−4.276	−3.888	C3
ENST00000358691	HELZ	−3.131	4.249	1.658	−3.729	−3.437	−2.929	3.413	C1
ENST00000355286	EYA4	−3.000	−1.900	2.671	−1.486	−2.087	−2.589	−0.756	C3
ENST00000381501	TEC	−2.996	−2.615	−1.656	3.085	−2.497	−2.427	−0.767	C3
ENST00000265382	PIP5K1B	−2.952	2.574	−0.576	−2.746	−3.246	−3.197	−1.583	C3
ENST00000359015	MAP3K5	−2.758	2.108	0.850	1.431	−3.555	−2.472	−3.293	C3
ENST00000335670	RORA	−2.586	−2.953	1.176	2.228	−2.499	−2.743	−0.526	C3
ENST00000370056	VAV3	−2.523	3.467	1.312	1.038	−3.282	−2.646	−1.406	C3
ENST00000381298	IL6ST	−2.522	−1.224	3.859	2.542	−2.120	−1.466	−2.644	C3
ENST00000432237	CD163	−2.506	−1.404	2.793	−2.306	−2.419	−0.629	−2.156	C3
ENST00000392552	GPR155	−2.338	−1.261	−1.336	2.338	−2.417	−1.608	−2.585	C3
ENST00000382292	SACS	−2.324	−4.408	3.917	2.284	−3.530	−2.726	−2.082	C3
ENST00000392132	XRCC5	−2.176	−2.147	2.722	−1.257	−2.673	−2.107	−1.787	C3
ENST00000313708	EBF1	−2.068	2.253	−1.422	−0.914	−2.980	−1.665	−3.222	C3
ENST00000400841	CRLF2	2.036	−1.347	−1.496	−2.083	2.581	2.185	1.052	C4
ENST00000369850	FLNA	2.058	−3.158	−0.860	−4.025	−3.073	−3.351	−3.097	C5
ENST00000344327	TRPC6	2.062	−3.776	−2.671	−2.711	−3.382	0.278	−2.242	C5
ENST00000263317	NOX4	2.225	−2.716	−2.554	−2.717	2.134	3.532	3.770	C4
ENST00000403662	CSF2RB	2.237	−2.363	1.702	0.620	2.613	0.319	2.782	C4
ENST00000297494	NOS3	2.243	1.436	−2.178	−0.851	2.109	2.460	2.455	C4
ENST00000295598	ATP1A1	2.258	−2.679	0.547	0.930	−2.204	−1.886	−2.448	C5
ENST00000085219	CD22	2.288	0.576	0.028	−2.311	2.368	−0.600	2.142	C4
ENST00000305877	BCR	2.397	−1.338	2.028	−3.021	3.994	2.923	3.631	C4
ENST00000333149	TRIM50	2.501	2.275	1.138	−2.022	2.271	1.372	3.197	C4
ENST00000271332	CELSR2	2.522	3.651	−2.581	2.455	2.443	−2.129	2.936	C2
ENST00000447648	TECPR1	2.666	−2.351	1.822	−1.669	2.777	3.213	−0.027	C4
ENST00000324856	ARID1A	3.434	−3.120	−2.683	−1.845	−2.085	−2.053	0.835	C5
ENST00000263094	ABCA7	3.796	−1.601	−2.581	1.004	2.591	3.325	3.998	C4
ENST00000372923	DNM1	3.941	−2.514	−1.222	−0.710	−2.066	−1.575	−2.077	C5
ENST00000621226	MUC5AC	3.965	−3.705	−3.751	−4.601	3.679	3.267	4.244	C4
ENST00000533211	SPTBN2	4.531	−2.266	1.209	−1.893	2.589	2.191	2.778	C4
ENST00000529681	MUC5B	4.744	3.085	1.483	−2.070	4.884	4.396	5.095	C4

Table 3

List of genes with a concordant signature of selection between DSC and SSC scores, after the comparison between target cohorts (CAR, FVG, VBI, QGP) and the relevant reference superpopulations from the 1000 Genomes Project (EUR, AFR, SAS).

		DSC Score							SSC Score
Transcript ID	Gene Name	QGP	CAR	FVG	VBI	EUR	AFR	SAS	QGP	CAR	FVG	VBI	EUR	AFR	SAS
ENST00000369850	FLNA	3.854	−2.435	0.272	−2.510	−2.399	−2.166	−1.879	2.058	−3.158	−0.860	−4.025	−3.073	−3.351	−3.097
ENST00000263094	ABCA7	2.566	−0.433	−2.168	0.071	3.020	2.681	2.150	3.796	−1.601	−2.581	1.004	2.591	3.325	3.998
ENST00000621226	MUC5AC	2.435	−2.404	−2.017	−2.892	3.477	3.500	3.032	3.965	−3.705	−3.751	−4.601	3.679	3.267	4.244

3.3. COVID-19 Cohort Analysis

Next, we included a cohort of 189 COVID-19 positive samples (TOV cohort) and calculated the number of singleton variants in this subset for the three genes of interest. Table 4 shows the results of the comparisons with the other study populations and the reference populations. If we consider the whole gene, we can see how, for the FLNA gene, the TOV cohort shows a small difference in the singleton density when compared to the FVG cohort and a more significant difference with the VBI and QGP cohorts, while ABCA7 and MUC5AC genes have consistently a significantly different pattern when compared with all reference and target populations (Table S8). If we consider only the coding part of each gene, we confirm the minor differences in the FLNA gene between the TOV cohort and the VBI and QGP cohorts. We also confirm the results for ABCA7 and MUC5AC (Table S9).

Table 4

Comparison of Singleton burden between the COVID-19 positive cohort (TOV) and other target and reference populations. The reported p-values refer to the comparison between whole gene singleton burden (“p-value whole gene” column) and coding regions singletons burden (“p-value CDS region” column). All singleton counts have been adjusted considering the sample size of each cohort.

Transcript ID	Gene Name	Cohort	p-Value Whole Gene	p-Value CDS Regions
ENST00000369850	FLNA	CAR	0.630140	0.409653
		FVG	0.046901	0.316565
		VBI	0.000458	0.013015
		QGP	0.000028	0.039803
		EUR	0.312323	0.786342
		AFR	0.878006	0.787767
		SAS	0.200408	0.813561
ENST00000263094	ABCA7	CAR	3.2746 × 10⁻¹¹	2.5959 × 10⁻⁷
		FVG	3.1278 × 10⁻²³	1.0535 × 10⁻¹⁷
		VBI	7.1607 × 10⁻²¹	2.0060 × 10⁻¹⁶
		QGP	6.2413 × 10⁻⁶³	1.5606 × 10⁻⁴⁰
		EUR	4.4467 × 10⁻¹⁰	1.0966 × 10⁻⁸
		AFR	1.7360 × 10⁻⁸	5.3713 × 10⁻⁹
		SAS	2.3435 × 10⁻⁴	3.7924 × 10⁻⁶
ENST00000621226	MUC5AC	CAR	7.4274 × 10⁻¹²	8.4692 × 10⁻¹¹
		FVG	1.8836 × 10⁻³¹	2.7623 × 10⁻²⁴
		VBI	2.3148 × 10⁻³⁶	3.2118 × 10⁻³⁰
		QGP	1.5512 × 10⁻⁷⁶	1.2101 × 10⁻⁶⁴
		EUR	3.3142 × 10⁻⁶	4.0071 × 10⁻⁸
		AFR	2.9701 × 10⁻⁸	1.0241 × 10⁻¹⁰
		SAS	7.2316 × 10⁻³	9.9471 × 10⁻⁵

To investigate if the burden of singletons in the prioritized genes could contribute to the disease severity classification, we performed multinomial logistic regression analyses. Disease severity class was the response variable, and age, gender and singleton count the explanatory variables. When we consider the contribution of the burden of singleton in the whole gene, the multinomial analyses showed that only age and gender are important predictors for the disease severity classification (Table S10, Figure S2). Using the burden of singletons in the coding regions as parameters in the regression model resulted in age and gender being significant predictors for being in class 2 vs. class 1 (p-values: 6.09 × and 0.04463 respectively) and belonging to class 3 vs. class 1 (p-values: 6.42 × and 0.02035). Age resulted in a significant predictor of being in class 2 vs. class 3 (p-value: 0.008617). Regarding genes contribution, the model highlighted only the FLNA gene as a significant predictor for being in class 2 vs. class 1 and for being in class 3 vs. class 2 (all p-values: < 2.2 × ) (Table S11). We also retrieved the disease outcome (Survived/Deceased) information and used the same parameters to perform a logistic regression analysis. As a result, we can still see that age is a major predictor for the outcome (p-value: 1.50 × 10−10) (Figure S3). We can also see a contribution of the ABCA7 gene, but only when we consider the number of singletons in the whole gene (p-value: 0.0228) (Tables S12 and S13).

4. Discussion

Since the H1N1 influenza pandemic in 1918, the ongoing COVID-19 pandemic is the most severe emergency we have met globally. For the scientific community, this emergency has been a wake-up call to join forces to fight back, investigate the effect of the virus on patients’ health, and understand the infection’s molecular mechanisms. All this ongoing effort is producing knowledge that is driving therapy and vaccine development. In this context, we focused on population-based statistics to characterize a subset of genes involved in the inflammation/immune response biological process. These statistics were obtained by analyzing populations with different ancestries and levels of inbreeding and consanguinity. We performed a comparison between our study cohorts and their matched reference populations, according to principal component analysis (PCA) results. The comparisons within our target populations allowed us to show the different patterns in genetic constraints for a large subset of genes involved in the immune response, leading to the prioritization of a group of genes that we could define as “the most differentiated” in terms of signatures of genetic constraints. These differences are most prominent for three genes (MUC5AC, ABCA7, FLNA), which harbor a pattern of relaxation in the QGP cohort with respect to other cohorts analyzed. This different pattern of relaxation could be a hint for a different impact of the role of these genes in different populations. Two of the genes, MUC5AC and FLNA, have already been linked to the COVID-19 host response to different degrees. The MUC5AC gene is a gel-forming mucin expressed in the lungs in response to infectious agents. This protein plays a protective role against inhaled pathogens, like influenza [38]. A recent study [39] compared levels of MUC5AC, MUC1 and MUC1-CT between critical ill COVID-19 patients and healthy controls, finding a significantly higher level of those proteins in the patients’ mucus. It is also worth noting that another recent work from Kousathanas et al. reported a significant genome-wide association between variants in the MUC1 gene and critical illness caused by [9,39]. The second gene, FLNA, codes for the Filamin A protein, which has been identified as a putative interaction candidate with coronaviruses S protein and is involved in the coronavirus replication cycle [40]. A recent study showed that the FLNA gene is part of the host protein-protein interaction (PPI) network for the SARS-CoV-2 virus and among the targets of different drugs under development [41]. A loss-of-function mutation of the FLNA gene was reported in family adults with emphysema [42]. There is no study showing a direct link between the ABCA7 gene and COVID-19 yet, but it has been proven that it is highly expressed in the reticuloendothelial system and modulates the phagocytosis activity [43,44], though its function, like many other ABC-transporters, has yet to be clarified. Interestingly the BioGRID interactome database [45] lists physical interactions of ABCA7 with ADBR2, C5AR2 and SGTB, among others. Each one of these genes has been linked to COVID-19 host response in previous studies, in terms of interaction [46], therapy [47] and severity in case of pre-existing health conditions [48]. Altogether with the information available on the prioritized genes and the knowledge of the different evolution of the pandemic between Qatar and Italy, we performed a proof-of-concept analysis. Using the information provided by a cohort of COVID-19 positive samples from Italy, to try to identify, if present, the contribution of the amount of ultra-rare variants in those genes to the outcome of the disease (Survived/Deceased) and the severity (Asymptomatic/Paucisymptomatic, Severe, Critical/life-threatening). From a cohort-based perspective, we can see differences in the distribution of singletons in the COVID-19 positive samples regarding our study populations and their reference populations. This outcome suggests that those three genes could play a role in the description of the cohort and that investigating rare genetic variations occurring in those genetic regions could be a starting point to complete the characterization of those samples. With a subsequent approach, we applied logistic regression analyses to investigate the impact of the singleton burden in the three prioritized genes on the disease outcome and disease severity. While the contribution of age and sex is explicit and expected, these analyses suggested that the burden of singletons carried by each patient in the ABCA7 gene could predict a worse outcome together with age. From the point of view of the disease severity, the burden of singleton in the FLNA gene could help discriminate samples with distinct levels of disease severity. In this case, though, our findings seem to be inconsistent since we find that having a lower burden of singletons is a predictor of developing a severe reaction. However, a high singleton burden is a predictor of developing a critical reaction. This finding can be better explained by looking at the distribution of singletons in our cohort stratified by disease severity. For the FLNA gene, all samples belonging to class 2 of severity do not carry any singleton. This feature could be introduced by one of the limitations of this study: the sample size of the COVID-19 positive cohort. Increasing the number of cases will undoubtedly allow us to have a better estimate of the singleton distribution. Moreover, in our model, we didn’t include any of the risk factors that are already linked to a diverse response to the infection. One last limitation could be represented by the inclusion of only one cohort of COVID-19 positive samples, for which only Whole Exome sequence data was available. We chose to include this cohort due to the phenotypical characterization, which allowed us to investigate our hypothesis of a genetic contribution to the disease severity prioritised genes. Nevertheless, for all the cohorts involved, information on the COVID-19 affected samples is already being collected. That will allow us to produce more precise results with further analyses.To our knowledge, this is the first study performing a whole-genome population-level comparison between Arabian and European populations, both differently affected by the pandemic. Recent similar studies focused only on the ACE2 receptor and populations from the 1000Genomes Project [49] or compared allele frequencies on covid-19 related genes in the Brazilian population with data from the 1000Genome and gnomAD datasets [50]. With the development of new vaccines against SARS-CoV-2 infection, we are bound to see a decrease in adverse disease outcomes and disease severity among the immunized populations. However, our work could be a starting point to better prioritize genes that could be therapeutical targets in different populations. Moreover, with the increased knowledge obtained thanks to the many studies that focused on understanding virus-host interaction, we could extend our method to any new similar threat that should arise in the future.

5. Conclusions

We were able to identify three candidate genes that could be further investigated for their role in the COVID-19 infection, and we want to stress the message that harnessing the information provided by rare genetic variants, in this still evolving context, is proving increasingly useful to explain the different outcomes of this disease.

43 in total

1. Qatar Biobank Cohort Study: Study Design and First Results.

Authors: Asma Al Thani; Eleni Fthenou; Spyridon Paparrodopoulos; Ajayeb Al Marri; Zumin Shi; Fatima Qafoud; Nahla Afifi
Journal: Am J Epidemiol Date: 2019-08-01 Impact factor: 4.897

2. The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic.

Authors:
Journal: Eur J Hum Genet Date: 2020-05-13 Impact factor: 4.246

3. Comparative genetic analysis of the novel coronavirus (2019-nCoV/SARS-CoV-2) receptor ACE2 in different populations.

Authors: Yanan Cao; Lin Li; Zhimin Feng; Shengqing Wan; Peide Huang; Xiaohui Sun; Fang Wen; Xuanlin Huang; Guang Ning; Weiqing Wang
Journal: Cell Discov Date: 2020-02-24 Impact factor: 10.849

4. Epidemiological investigation of the first 5685 cases of SARS-CoV-2 infection in Qatar, 28 February-18 April 2020.

Authors: Hanan M Al Kuwari; Hanan F Abdul Rahim; Laith J Abu-Raddad; Abdul-Badi Abou-Samra; Zaina Al Kanaani; Abdullatif Al Khal; Einas Al Kuwari; Salih Al Marri; Muna Al Masalmani; Hamad E Al Romaihi; Mohamed H Al Thani; Peter V Coyle; Ali N Latif; Robert Owen; Roberto Bertollini; Adeel Ajwad Butt
Journal: BMJ Open Date: 2020-10-07 Impact factor: 2.692

5. Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK.

Authors: Merryn Voysey; Sue Ann Costa Clemens; Shabir A Madhi; Lily Y Weckx; Pedro M Folegatti; Parvinder K Aley; Brian Angus; Vicky L Baillie; Shaun L Barnabas; Qasim E Bhorat; Sagida Bibi; Carmen Briner; Paola Cicconi; Andrea M Collins; Rachel Colin-Jones; Clare L Cutland; Thomas C Darton; Keertan Dheda; Christopher J A Duncan; Katherine R W Emary; Katie J Ewer; Lee Fairlie; Saul N Faust; Shuo Feng; Daniela M Ferreira; Adam Finn; Anna L Goodman; Catherine M Green; Christopher A Green; Paul T Heath; Catherine Hill; Helen Hill; Ian Hirsch; Susanne H C Hodgson; Alane Izu; Susan Jackson; Daniel Jenkin; Carina C D Joe; Simon Kerridge; Anthonet Koen; Gaurav Kwatra; Rajeka Lazarus; Alison M Lawrie; Alice Lelliott; Vincenzo Libri; Patrick J Lillie; Raburn Mallory; Ana V A Mendes; Eveline P Milan; Angela M Minassian; Alastair McGregor; Hazel Morrison; Yama F Mujadidi; Anusha Nana; Peter J O'Reilly; Sherman D Padayachee; Ana Pittella; Emma Plested; Katrina M Pollock; Maheshi N Ramasamy; Sarah Rhead; Alexandre V Schwarzbold; Nisha Singh; Andrew Smith; Rinn Song; Matthew D Snape; Eduardo Sprinz; Rebecca K Sutherland; Richard Tarrant; Emma C Thomson; M Estée Török; Mark Toshner; David P J Turner; Johan Vekemans; Tonya L Villafana; Marion E E Watson; Christopher J Williams; Alexander D Douglas; Adrian V S Hill; Teresa Lambe; Sarah C Gilbert; Andrew J Pollard
Journal: Lancet Date: 2020-12-08 Impact factor: 79.321

6. Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.

Authors: Lindsey R Baden; Hana M El Sahly; Brandon Essink; Karen Kotloff; Sharon Frey; Rick Novak; David Diemert; Stephen A Spector; Nadine Rouphael; C Buddy Creech; John McGettigan; Shishir Khetan; Nathan Segall; Joel Solis; Adam Brosz; Carlos Fierro; Howard Schwartz; Kathleen Neuzil; Larry Corey; Peter Gilbert; Holly Janes; Dean Follmann; Mary Marovich; John Mascola; Laura Polakowski; Julie Ledgerwood; Barney S Graham; Hamilton Bennett; Rolando Pajon; Conor Knightly; Brett Leav; Weiping Deng; Honghong Zhou; Shu Han; Melanie Ivarsson; Jacqueline Miller; Tal Zaks
Journal: N Engl J Med Date: 2020-12-30 Impact factor: 91.245

7. Characterizing the Qatar advanced-phase SARS-CoV-2 epidemic.

Authors: Laith J Abu-Raddad; Hiam Chemaitelly; Houssein H Ayoub; Zaina Al Kanaani; Abdullatif Al Khal; Einas Al Kuwari; Adeel A Butt; Peter Coyle; Andrew Jeremijenko; Anvar Hassan Kaleeckal; Ali Nizar Latif; Robert C Owen; Hanan F Abdul Rahim; Samya A Al Abdulla; Mohamed G Al Kuwari; Mujeeb C Kandy; Hatoun Saeb; Shazia Nadeem N Ahmed; Hamad Eid Al Romaihi; Devendra Bansal; Louise Dalton; Mohamed H Al-Thani; Roberto Bertollini
Journal: Sci Rep Date: 2021-03-18 Impact factor: 4.379

8. Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity.

Authors: Tõnu Esko; Massimo Mezzavilla; Mari Nelis; Christelle Borel; Tadeusz Debniak; Eveliina Jakkula; Antonio Julia; Sena Karachanak; Andrey Khrunin; Peter Kisfali; Veronika Krulisova; Zita Aušrelé Kučinskiené; Karola Rehnström; Michela Traglia; Liene Nikitina-Zake; Fritz Zimprich; Stylianos E Antonarakis; Xavier Estivill; Damjan Glavač; Ivo Gut; Janis Klovins; Michael Krawczak; Vaidutis Kučinskas; Mark Lathrop; Milan Macek; Sara Marsal; Thomas Meitinger; Béla Melegh; Svetlana Limborska; Jan Lubinski; Aarno Paolotie; Stefan Schreiber; Draga Toncheva; Daniela Toniolo; H-Erich Wichmann; Alexander Zimprich; Mait Metspalu; Paolo Gasparini; Andres Metspalu; Pio D'Adamo
Journal: Eur J Hum Genet Date: 2012-12-19 Impact factor: 4.246